8 External Validity, Generalizability, and Transportability
In 1997, Clinical Trial 320 delivered news that HIV/AIDS researchers and patients had been waiting for: a three-drug combination therapy reduced the risk of AIDS or death by 49% compared to standard treatment. The findings helped launch the antiretroviral era that would transform HIV from death sentence to manageable chronic condition.
Years later, researchers asked a different question: What would that treatment effect have looked like in the broader U.S. population of people living with HIV—a population that differed from trial participants in age, race, and disease severity? When they ran the analysis, the estimated benefit shrank to 43%. Same treatment. Same outcome. Different answer.
The later analysis illustrates something important: causal effects estimated in one population don’t automatically apply to another. Understanding when they do—and when they don’t—is the problem of external validity.
This chapter unpacks that problem. We’ll distinguish three related but distinct concepts—external validity, generalizability, and transportability—and develop a framework for thinking rigorously about when and why research findings travel.
8.1 External Validity: The Umbrella Concept
In Chapter 7, we focused on internal validity: Is the causal effect we estimated correct for the people we actually studied? External validity asks the next question: Does that effect apply to the population or context we care about?
External validity is always a question about a specific target population. You cannot say a study is “externally valid” in the abstract—only that it is externally valid for a particular population of interest.
While internal validity is specifically a causal inference concept, external validity concerns arise more broadly. A prevalence estimate, an association, a prediction model—each can be correct for the sample while failing to apply elsewhere—to the target population the sample was drawn from or to new populations. Whenever we extend findings beyond the local study that produced them, we face an external validity question.
GENERALIZABILITY AND TRANSPORTABILITY
External validity concerns take two forms, depending on the relationship between the study sample and the target population (Lash et al., 2021).
Generalizability applies when the study sample is entirely contained within the target population—when the sample is a subset of the target. A clinical trial conducted at Kenyan hospitals, aiming to inform treatment policy for all Kenyans, faces a generalizability question: Do the trial participants adequately represent the broader target population they were drawn from? A national survey faces the same question: Does the sample reflect the population it was designed to represent? Concerns here arise from selection, nonresponse, eligibility criteria, and attrition.
In contrast, transportability applies when the study sample is not contained within the target population—when you want to apply the results to a different population entirely. If we want to know whether the Kenyan trial’s results apply in Uganda, or whether a prevalence estimate from urban clinics applies to rural areas that were never sampled, we face a transportability question. The study population is external to the target population.
WHEN EXTERNAL VALIDITY IS LIMITED
External validity bias arises when an estimate produced in a study differs from the corresponding quantity in a specified target population because the conditions that generated the estimate do not hold in that population.
For descriptive findings like prevalence, this is relatively straightforward. If your sample over-represents certain groups, your estimate likely won’t match the target population. The solution is familiar: weight the sample to look like the population.
For causal effects, the problem is more specific. What matters is not whether the populations differ on any characteristic, but whether they differ on characteristics that modify the effect—variables for which the treatment works differently across levels.
A variable is an effect modifier if the causal effect of the treatment changes depending on that variable’s value. Age might modify the effect of a blood pressure drug if the drug lowers blood pressure more in younger patients than older ones. Disease severity might modify the effect of a behavioral intervention if it works well for mild cases but poorly for severe ones.
If a treatment works equally well in everyone, population differences don’t threaten external validity. Your trial could over-sample young people, wealthy people, or urban residents—and as long as the treatment effect is the same across those groups, your estimate still applies to your target population. But if the treatment works differently across groups, then differences between your sample and target population become consequential. Effect modification is why some population differences matter for external validity and others do not.
This logic applies to both generalizability and transportability. Whether you’re asking if a trial represents its source population or whether findings from one country apply in another, the key question is the same: Do the populations differ on characteristics that modify the estimate?
8.2 Generalizability: Do the Results of the Study Sample Apply to the Target Population?
The defining characteristic of generalizability is that everyone in the study sample is also a member of the target population.
One way to achieve generalizability is through representative sampling. If the sample is drawn randomly from the target population—or drawn using probability methods with known selection probabilities—then the sample should, on average, look like the population. This is the standard approach for prevalence studies and surveys. The DHS example below illustrates this logic: sample representatively, weight appropriately, and the estimate generalizes.
But trials almost never work this way. Randomized trials randomly assign treatment, but they rarely randomly sample participants from the target population. Instead, trials enroll volunteers who meet specific eligibility criteria, provide informed consent, and can adhere to the study protocol. This is not a design flaw—it reflects the realities of conducting ethical research. You cannot randomly select people and compel them to participate in a trial. Enrollment requires willingness, which is itself selective.
As a result, trial samples routinely differ from the populations they aim to inform. Participants tend to be younger, healthier, more educated, and more engaged with the health system than typical patients. Trials conducted at academic medical centers draw from different populations than community clinics. Strict eligibility criteria exclude patients with comorbidities, complex medication regimens, or unstable social circumstances—exactly the patients who may later receive the treatment in routine care.
This means generalizability for trials is not primarily a sampling problem—it’s an effect modification problem. The question is not “did we sample representatively?” (we almost certainly didn’t) but rather “do the characteristics on which our sample differs from the target population modify the treatment effect?” If they don’t, the estimate still applies. If they do, it may not.
ILLUSTRATING GENERALIZABILITY
The following examples show how generalizability works in practice—and where it breaks down. In each example, the study sample is drawn from the target population.
Representative sampling makes generalizability easier
The 2011 Nepal Demographic and Health Survey illustrates the clearest path to generalizability (Ministry of Health and Population et al., 2011). DHS researchers used multistage cluster sampling to survey 10,826 households across Nepal, interviewing 12,674 women ages 15–49. The sampling design was explicitly constructed to provide nationally representative estimates: the country was divided into 13 eco-development regions, each stratified by urban and rural areas, with sampling weights calculated to account for the complex design.
The target population was clearly defined (women of reproductive age in Nepal), and the study sample was designed to be a representative subset of it. The DHS could estimate that 50% of currently married women were using some method of contraception and that 27% had an unmet need for family planning. Because the sample was drawn from the target population using probability methods, and because weights adjust for known selection probabilities, these estimates generalize—they apply to Nepali women of reproductive age, not just to the women surveyed.
This is the gold standard for descriptive generalizability: design the sample to represent the target, use probability methods, and adjust for selection.
Generalizability is possible with non-probability samples
But what if you don’t have a probability sample? Many studies rely on convenience samples—patients at a clinic, respondents to an online survey, or participants recruited through social networks. These samples are not designed to represent a target population, but generalizability may still be achievable if you can adjust for the differences.
Consider a hypothetical example. Suppose a research team wants to estimate modern contraceptive use among women ages 15–49 in Nepal before the next DHS survey is scheduled. But instead of a costly household survey, they run an online survey promoted through social media and health-related websites. Within weeks, they collect responses from 5,000 women across the country.
The convenience is obvious; so is the problem. Women who respond to an online survey in Nepal are not representative of all Nepali women. They are overwhelmingly urban, younger, more educated, wealthier, and more likely to have smartphone access. The raw estimate of modern contraceptive use from this sample would almost certainly overstate use in the national population.
One approach to address this is multilevel regression with poststratification (MRP)—a method that combines statistical modeling with population data to adjust non-representative samples (Alexander, 2023). The logic works in two steps. First, researchers fit a model that predicts the outcome (contraceptive use) based on respondent characteristics available in both the online survey and a population reference, such as census data or a nationally representative survey. This could include variables such as age group, education level, urban/rural residence, and wealth quintile. This model estimates how contraceptive use varies across these groups, even if some groups are underrepresented in the sample. Second, researchers use population data to determine how common each combination of characteristics is in the target population, then weight the model predictions accordingly. The result is an estimate that reflects the composition of the national population, not the convenience sample.
MRP has been used successfully in highly skewed samples. In a famous example, researchers used survey data from Xbox (the gaming system) users—65% young men—to predict US presidential election outcomes, achieving accuracy comparable to traditional polls after poststratification (Wang et al., 2015). The method doesn’t eliminate all bias, but it can dramatically reduce it when the adjustment variables capture the key differences between sample and population.
The critical assumption is that the model includes the variables that matter. If contraceptive use in Nepal depends heavily on characteristics that differ between online respondents and the general population—and those characteristics aren’t captured in the model—then adjustment won’t fully correct the bias. For instance, if women who respond to online surveys differ from others in their exposure to family planning information or their autonomy in reproductive decisions, and these factors aren’t measured, the adjusted estimate will still be biased.
The point is that generalizability from convenience samples is possible, but it requires explicit modeling assumptions that probability sampling does not. The DHS approach—design the sample to be representative—remains the gold standard. MRP and similar methods are valuable when representative sampling isn’t feasible, but they trade design-based inference for model-based inference, with all the assumptions that entails.
When generalizability is limited
Sometimes generalizability is simply not achievable, and the honest response is to acknowledge the limits of inference.
Consider a qualitative study by Bhatt et al. (2021) examining perceptions of family planning services among young people in Nepal. The researchers conducted focus group discussions and in-depth interviews with adolescents and young adults ages 15–24. But all data collection occurred in a single village—Hattimuda, in Morang district.
The study reveals how young people in this community think about family planning: the barriers they perceive, the role of gender dynamics, the gap between what’s taught in schools and what young people actually understand. This is valuable knowledge. But the study sample (recruited from young people in Hattimuda) is not a representative subset of “all Nepali youth” in any formal sense. Young people in one eastern village may or may not share the same perceptions as youth in urban Kathmandu, the Terai plains, or western hill districts.
The authors acknowledge this directly: “Our findings might differ if the sample had been drawn from other parts of the country.” This is an honest statement about the limits of generalizability when your study population cannot be shown to represent your target. The findings are not useless—they provide insight into one context—but extending them to the national level requires assumptions that cannot be verified from the study itself.
A causal example: A family planning trial in Nepal
Now consider a causal question. Suppose researchers conduct a randomized trial evaluating a community health worker intervention to increase modern contraceptive use among married women in Nepal. The trial is conducted in 20 primary health centers across four districts, enrolling 2,000 women who are not currently using modern methods. After 12 months, women in the intervention arm show a 15 percentage point increase in modern method uptake compared to the control arm.
The trial has strong internal validity—randomization ensures that the effect estimate is unbiased for the women enrolled. But Nepal’s Ministry of Health wants to know whether this effect would hold if the intervention were scaled nationally. This is a generalizability question: the trial participants are a subset of Nepali women of reproductive age, but they may not represent that larger population.
And indeed, they probably don’t. Trial participants were recruited from health centers, meaning they were already engaged with the health system. They consented to participate in a research study, suggesting openness to new information. The four districts may differ from others in urbanization, ethnicity, or health infrastructure. Women with the most barriers to contraceptive use—those who never visit health centers, who face stronger family opposition, or who live in remote areas—are systematically underrepresented.
Addressing this requires the same logic as generalizability for descriptive estimates, but applied to causal effects. Researchers would need to: (1) identify plausible effect modifiers based on theory and prior evidence, (2) measure those variables in both the trial sample and the target population (perhaps using DHS data as a reference), and (3) assess whether the distributions differ in ways that would change the expected effect. If they do, statistical methods—including the same weighting and modeling approaches used for transportability—can adjust the trial estimate to better reflect the target population (Dahabreh et al., 2019; Stuart et al., 2011).
The generalizability concern is familiar from the descriptive examples above, but for causal effects it takes a specific form: what matters is not whether the trial sample differs from the national population on any characteristic, but whether it differs on characteristics that modify the treatment effect. If the intervention works equally well regardless of health-seeking behavior, education, or geographic isolation, then the trial estimate applies nationally even though the sample was unrepresentative. But if the intervention works better among women already engaged with health services—because they’re easier to reach, more receptive to counseling, or face fewer structural barriers—then the 15 percentage point effect may overstate what would happen at scale.
The key distinction from transportability is not the method but the relationship between populations. For generalizability, the trial sample is contained within the target population; for transportability, it is external. But in both cases, the core question is the same: do the populations differ on effect modifiers? And in both cases, adjustment is only possible when those modifiers are measured and the necessary assumptions hold.
8.3 Transportability: Do the Results Apply to a Different Population Entirely?
Generalizability addresses cases where the study sample is contained within the target population. Transportability addresses the harder problem: applying findings to a population the study never sampled from—where the study population is external to the target population.
A ministry of health in Kenya reads a trial conducted in India and wonders whether the intervention would work in their context. A funder reviews evidence from urban clinics and asks whether the findings apply to rural settings that were never included in the study. A policymaker sees impressive results from a research-intensive trial and asks whether they would hold under routine conditions in a different health system.
No improvement to the original study’s sampling design helps answer these questions. The Kenyan ministry isn’t asking whether the Indian trial represented its source population—they’re asking whether findings from India apply in Kenya. The study population and target population don’t overlap at all.
The same effect modification logic from earlier applies here, but the problem is more challenging. For generalizability, you’re asking whether effect modifiers are distributed differently in your sample than in the same target population it was drawn from. For transportability, you’re asking whether effect modifiers are distributed differently across entirely separate populations—and whether the causal mechanism that produced the effect in one setting would operate the same way in another.
This is why transportability requires reasoning about how an intervention works, not just whether it worked. If you understand the mechanism—and the conditions under which it operates—you can reason about whether those conditions are likely to hold elsewhere.
TRANSPORTABILITY IN PRACTICE: AN HIV TREATMENT TRIAL
To see how transportability works in practice, consider a well-known example from HIV research (Cole et al., 2010).
In the mid-1990s, a randomized clinical trial known as ACTG 320 evaluated the effectiveness of combination antiretroviral therapy among adults living with HIV in the United States. The trial enrolled patients with advanced immunosuppression and compared combination therapy with standard treatment at the time. The results were striking: combination therapy reduced the risk of AIDS or death by roughly half (hazard ratio = 0.51). This was one of the studies that established HAART as standard care and transformed HIV from a death sentence into a manageable chronic condition.
The trial had strong internal validity. Randomization ensured that, among participants, the estimated effect of treatment was unbiased. But the question that emerged a decade later was not about whether the trial was correct—it was about whether the result still applied.
By the mid-2000s, the population of people living with HIV in the United States looked very different from the population enrolled in ACTG 320. Advances in testing meant that many people were diagnosed earlier. Treatment guidelines had evolved. Patients were younger, more racially diverse, and often less immunosuppressed at the time treatment began. Public health officials wanted to know: Would the dramatic benefits observed in ACTG 320 still hold for the broader HIV population a decade later?
This is a transportability problem. The trial population is not a subset of the target population. It represents a different historical moment, with different patient characteristics and clinical context.
Why naïve application fails
One tempting response would be to simply apply the trial’s effect estimate to the contemporary HIV population. But doing so assumes that the treatment effect is the same for everyone—regardless of age, disease severity, or other characteristics that changed over time.
That assumption is unlikely to hold. For antiretroviral therapy, baseline immune status matters. Patients with very low CD4 counts may experience larger absolute benefits from treatment than those who begin therapy earlier in the disease course. Age, sex, and other clinical factors may also shape how patients respond.
In other words, the trial and target populations differed not just in who they included, but in characteristics that plausibly modify the treatment effect. If so, the trial’s effect estimate cannot simply be “transported” to the new population.
The core transportability idea
Cole and Stuart approached this problem by asking a counterfactual question: What would the ACTG 320 trial have shown if it had been conducted in the later US HIV population instead of the original trial population?
Answering this question requires two ingredients.
First, researchers must identify effect modifiers—characteristics that change how well the treatment works. Based on clinical knowledge, these included age and markers of disease severity.
Second, researchers must know how these characteristics are distributed in both populations: among ACTG 320 trial participants, and in the later US HIV population (using surveillance and cohort data).
The populations differed substantially. The trial over-sampled older patients: 91% were age 30 or older, compared to 66% in the target population. The trial also under-sampled Black patients (28% versus 46%) and over-sampled white patients (54% versus 36%).
If the treatment effect differs across levels of these characteristics, and if the distributions differ between populations, then the original trial estimate will not apply directly.
Effect modification in action
So did the treatment effect vary? When the researchers examined age-stratified results, they found striking heterogeneity:
| Age group | Hazard ratio |
|---|---|
| 13–29 | 1.87 |
| 30–39 | 0.21 |
| 40–49 | 0.84 |
| 50+ | 0.59 |
Among patients in their 30s, treatment was dramatically protective (HR = 0.21, roughly 80% risk reduction). Among the youngest patients, treatment appeared to increase risk—though this estimate was imprecise given the small subgroup. The overall trial estimate of 0.51 masked substantial variation.
The age-stratified estimates have wide confidence intervals because each subgroup is small. The point is not that we know these stratum-specific effects precisely, but that there is clear evidence of heterogeneity—the treatment effect is not constant across age groups.
Now the population differences matter. The trial was dominated by patients in their 30s and 40s—the groups where treatment worked best. The target population was younger, with over a third under age 30.
Re-expressing the trial effect
Rather than discarding the original trial results, Cole and Stuart showed how the trial data could be reweighted so that the distribution of key effect modifiers matched that of the target population. Conceptually, this means giving more weight to trial participants who resemble people in the later HIV population, and less weight to those who do not.
The goal was not to “fix” the trial or make it representative in a sampling sense. Instead, the goal was to re-express the causal effect under the conditions that define the target population.
When this adjustment was applied, the estimated treatment effect changed:
- Trial estimate (intent-to-treat): HR = 0.51
- Transported estimate (weighted to target population): HR = 0.57 (moved closer to 1, which represents a shrinking of the effect)
The effect was muted by about 12%. The benefit of combination therapy was still substantial, but attenuated. The difference arose not because the trial was biased, but because the population to which the result was being applied had changed in ways that mattered for treatment effectiveness.
What this example teaches
This HIV example illustrates several core ideas about transportability.
Lack of representativeness is not a flaw. ACTG 320 was not designed to represent all people living with HIV in the future, and it did not need to be. Its internal validity was strong.
Differences between populations only matter when they modify the effect. Many characteristics differed between the trial and target populations, but only those that changed how treatment worked were relevant for transportability.
Transportability is not automatic. It requires explicit assumptions about which features of the population matter, and it requires data on those features in both populations. When key effect modifiers are unmeasured—as Cole and Stuart note for CD4 count in the target population—transportability analysis is limited, not because of a statistical failure, but because of missing information.
Transportability methods address only one source of external validity bias: differences in who receives treatment when treatment effects are heterogeneous. They do not address changes in the treatment itself, differences in outcome definitions, or shifts in health system context. Those issues require substantive judgment, not statistical adjustment.
Transportability methods emerged from two different traditions in causal inference.
One tradition uses causal diagrams—directed acyclic graphs (DAGs)—to represent how variables relate to each other. Within this framework, researchers formalize transportability by examining the causal structure: which variables differ between populations, and whether those differences block or modify the causal pathway from treatment to outcome. This approach uses the structure of the diagram itself to determine whether—and under what conditions—a causal effect can be transported.
The other tradition works within the potential outcomes framework, focusing on what would happen to the same individual under different treatments. Here, transportability is approached through statistical estimation: weighting trial participants to resemble the target population, or modeling how effects vary across characteristics and projecting those models onto new populations. This tradition has produced the practical estimation methods—propensity score weighting, outcome regression, and their combinations—that researchers apply in practice.
Despite their different mathematical machinery, the two traditions converge on the same fundamental insight: what matters for transportability is whether populations differ on characteristics that modify the treatment effect. The graphical approach helps you reason about when transport is possible; the potential outcomes approach provides tools for how to do it.
FROM EXAMPLE TO GENERAL PRINCIPLE
The ACTG 320 example illustrates a broader point about how to think through any transportability question. Before assuming that evidence from one population applies to another, ask:
On what characteristics do the populations differ? This requires data on both the study population and the target population—ideally measured the same way.
Which of those characteristics might modify the treatment effect? This requires reasoning about mechanisms. Why might the treatment work differently across levels of a given characteristic?
How large are the differences on those effect modifiers? Small differences on strong effect modifiers, or large differences on weak effect modifiers, may not matter much. Large differences on strong effect modifiers will.
Can we adjust for the differences? If the relevant effect modifiers are measured in both populations, statistical methods can re-weight or model the evidence to produce a transported estimate. If they aren’t measured, adjustment isn’t possible.
These questions won’t always have clean answers. Often we don’t know which characteristics modify effects, or we can’t measure them in both populations, or we’re uncertain about mechanisms. But asking the questions explicitly is better than assuming evidence transports unchanged—or assuming it doesn’t transport at all.
WHY THIS MATTERS FOR GLOBAL HEALTH
In global health, transportability is not an edge case—it is the norm. Evidence is routinely generated in one setting and applied in another: across countries, health systems, and historical periods.
The ACTG 320 example shows that transportability is not about deciding whether a study “applies” in some vague sense. It is about making explicit the assumptions required to extend a causal claim from one population to another—and about recognizing when those assumptions are plausible, questionable, or untenable.
Transportability methods offer one way to formalize this reasoning when the necessary data are available. But even when formal adjustment isn’t possible, the underlying logic remains valuable: identify what differs, reason about whether those differences modify effects, and be honest about what you don’t know.
8.4 Designing for External Validity
The methods we’ve discussed address external validity after the fact: given a completed study, can we transport its findings to a new population? But there’s a prior question worth asking: How do we generate evidence that’s transportable in the first place?
Transportability isn’t just an analytical problem to solve after data collection ends. It’s a design problem that begins before the first participant is enrolled. The choices researchers make—about who participates, what gets measured, and what data exist on target populations—determine whether transportability analysis is even possible later.
MEASURING WHAT MATTERS
Transportability analysis requires knowing how effect modifiers are distributed in both the study population and the target population. This creates a simple but demanding requirement: you must measure the right things in both places.
If you suspect that age modifies a treatment’s effect but don’t collect age data in your trial, you can’t later assess whether age differences between populations explain divergent results. If a ministry of health wants to know whether trial evidence applies to their population but no one has collected data on that population’s characteristics, there’s nothing to transport to.
This is a design imperative, not an afterthought. Before launching a study, researchers should ask: What characteristics might plausibly modify how this treatment works? Are we measuring them? Will comparable measurements exist in the populations we hope to inform?
The answers won’t always be clear. Effect modification is often discovered rather than predicted. But the question itself—what might matter, and are we measuring it?—focuses attention on external validity from the start rather than treating it as a limitation to acknowledge in the discussion section.
ENROLLING FOR HETEROGENEITY
Clinical trials have historically enrolled narrow populations. Women were excluded from many trials until the 1990s, partly due to concerns about pregnancy but also due to assumptions that findings in men would apply to women. Older adults are routinely excluded despite being the primary users of many treatments. Racial and ethnic minorities remain underrepresented across therapeutic areas.
These exclusions are often framed as equity concerns—and they are. But the problem is also scientific. Homogeneous trials cannot reveal effect heterogeneity. If everyone in your study is young, male, and healthy, you cannot learn whether age, sex, or comorbidities modify the treatment effect. You cannot identify effect modifiers you haven’t studied. You cannot assess transportability to populations you’ve excluded.
Consider what this means for the ACTG 320 example. The researchers could examine whether treatment effects varied by age only because the trial enrolled patients across age groups. Had the trial enrolled only patients in their 30s—the group where treatment worked best—the overall estimate would have looked even more impressive, but it would have been far less informative about what to expect in other populations.
Diversity in trials isn’t a box to check for regulatory approval. It’s the raw material for understanding effect heterogeneity. Inclusive enrollment produces evidence that reveals how treatments work across groups—evidence that supports rather than undermines transportability.
REAL-WORLD EVIDENCE AND TARGET POPULATIONS
Real-world evidence (RWE)—data from electronic health records, insurance claims, disease registries, and other routine sources—plays a specific role in transportability: it describes the target populations we want to reach.
When Cole and Stuart transported the ACTG 320 trial results to the 2006 US HIV population, they needed to know the age, sex, and race distribution of that target. CDC surveillance data provided it. Without data on the target population, transportability analysis wouldn’t have been possible. The statistical methods are useless without something to transport to.
This is why data infrastructure matters for external validity. Countries and health systems that invest in routine data collection—surveillance systems, registries, linked administrative databases—create the foundation for transportability analysis. Those without such infrastructure face a harder problem: they may have evidence from elsewhere but no way to formally assess whether it applies locally.
RWE serves another function as well. Trials establish efficacy under controlled conditions: selected patients, close monitoring, protocol-driven care. Real-world data reveal whether effects persist when treatments reach broader populations—patients with multiple comorbidities, those who miss appointments, people who would never have enrolled in a trial. This isn’t replacing randomization with observation. It’s asking a different question: Does the effect hold outside the conditions that produced it?
Regulatory agencies increasingly recognize this. The US FDA and European Medicines Agency now incorporate real-world evidence into certain decisions, particularly for understanding how treatments perform in populations underrepresented in trials. The goal isn’t to lower the bar for causal inference—it’s to raise the bar for external validity.
8.5 Anticipating Threats: A Design-Stage Checklist
The methods we’ve discussed—weighting, standardization, transportability analysis—are treatments for external validity problems that have already occurred. But as in medicine, prevention is often better than treatment. The time to think about external validity is before the first participant is enrolled, not after the data are collected.
A classic framework from Shadish, Cook, and Campbell (2002) identifies five sources of external validity threat. Each represents a way that causal relationships might fail to hold across variations in study conditions. Framed as questions, they become a design-stage checklist.
WOULD THE EFFECT HOLD FOR DIFFERENT PEOPLE?
This threat—what Shadish and colleagues call interaction of the causal relationship with units—asks whether findings would differ if different kinds of participants had been studied.
Clinical trials routinely exclude patients with comorbidities, limited literacy, or unstable living situations. The intervention may work beautifully among healthy, motivated volunteers but fail among sicker patients who face competing demands. In the ACTG 320 example, treatment effects varied substantially by age—information only available because the trial enrolled across age groups.
At the design stage, ask: Who are we excluding, and why? Are those exclusions scientifically necessary, or merely convenient? What characteristics of participants might plausibly modify the treatment effect? Are we enrolling enough diversity to detect such heterogeneity?
WOULD THE EFFECT HOLD FOR TREATMENT VARIATIONS?
This threat—interaction over treatment variations—asks whether findings depend on specific features of how the intervention was delivered.
A community health worker program might work when health workers are well-trained, supervised weekly, and paid reliably. It may fail when implemented at scale with minimal training, infrequent supervision, and delayed payment. A drug tested in combination with standard care may behave differently when combined with other medications patients actually take.
At the design stage, ask: How might this intervention be delivered differently in other contexts? Are we testing a specific, replicable protocol or an idealized version that won’t survive contact with real health systems? What are the “active ingredients,” and what might dilute them?
WOULD THE EFFECT HOLD FOR DIFFERENT OUTCOMES?
This threat—interaction with outcomes—asks whether findings depend on how the outcome was measured.
A training program might improve knowledge scores on a written test but not actual behavior change. A treatment might reduce symptoms as measured by clinician rating but not by patient self-report. An intervention might improve a biomarker but not the clinical endpoint patients actually care about.
At the design stage, ask: Are we measuring what ultimately matters, or a convenient proxy? Would stakeholders accept this outcome as meaningful? If the effect holds for this outcome, would we expect it to hold for related outcomes that weren’t measured?
WOULD THE EFFECT HOLD IN DIFFERENT SETTINGS?
This threat—interaction with settings—asks whether findings depend on features of where the study was conducted.
An intervention tested in a well-resourced academic medical center may not work in an understaffed rural clinic. A program that succeeds when researchers provide intensive support may falter under routine implementation. A finding from a health system with universal insurance coverage may not apply where patients pay out of pocket.
At the design stage, ask: What features of our study setting might not exist elsewhere? Are we studying conditions that are unusually favorable (or unfavorable)? What would need to be true about a setting for our findings to apply there?
WOULD THE SAME MECHANISM OPERATE ELSEWHERE?
This final threat—context-dependent mediation—asks whether the reason an intervention works might differ across contexts, even if the overall effect appears similar.
A family planning counseling program might increase contraceptive uptake in one setting by improving knowledge, but in another setting by reducing stigma around seeking services. If the mechanism differs, then features that support one pathway (information materials) may be less relevant than features that support the other (community engagement). The “same” intervention may require different implementation strategies in different contexts.
At the design stage, ask: Do we understand why this intervention should work? What mechanisms are plausible, and might those mechanisms operate differently elsewhere? Are we measuring mediators that would help us understand variation in effects across contexts?
USING THE CHECKLIST
These five questions won’t guarantee external validity. Many threats only become apparent after a study is complete—or after findings fail to replicate in new settings. But asking these questions early serves several purposes.
First, it identifies design choices that might expand or limit transportability. Broader eligibility criteria, more diverse settings, multiple outcome measures, and assessment of mediators all preserve options for later analysis and interpretation.
Second, it encourages realistic expectations. A tightly controlled efficacy trial in ideal conditions tells you what’s possible; it doesn’t promise what will happen at scale. Acknowledging this upfront is more honest than discovering it later.
Third, it shifts external validity from an afterthought to an explicit design consideration. The goal isn’t to make every study maximally generalizable—that’s neither possible nor always desirable. The goal is to be intentional about scope, so that the boundaries of what was learned are clear.
8.6 Closing Reflection
Return to Clinical Trial 320. A treatment that reduced the risk of AIDS or death by 49% in the trial would have reduced it by only 43% in the broader population. Same drug, same outcome, different answer.
That gap isn’t a failure of the trial. The researchers got the right answer for the people they studied. The gap reflects something more fundamental: causal effects have scope. They hold for particular populations, under particular conditions. Extending them beyond that scope requires assumptions—about which characteristics modify the effect, about how populations differ, about what stayed constant and what changed.
This chapter has given you a framework for reasoning about that scope. Effect modification is the key insight: population differences only matter when they involve characteristics that change how treatments work. Generalizability and transportability name two different versions of the problem, depending on whether your sample is contained within the target population or external to it. Statistical methods—weighting, standardization, modeling—can adjust for differences when you’ve measured the right variables, but they formalize assumptions rather than eliminate them. And the design-stage checklist reminds you that external validity thinking should begin before the first participant is enrolled, not after data collection ends.
In global health, these concerns are not edge cases—they are the default condition. Evidence is routinely generated in one place and applied in another. Trials from high-income countries inform guidelines implemented in low-resource settings. Studies from urban academic centers shape policy for rural health posts thousands of kilometers away. When evidence is transported without attention to effect modification, the populations most different from trial participants may receive interventions that work less well for them—or don’t work at all.
The goal isn’t to become paralyzed by uncertainty. It’s to ask better questions: For whom was this shown? How might my population differ? What would have to be true for this evidence to apply here?
Evidence is born in specific places, but decisions must live everywhere. Your job is to reason carefully about the path between the two.