12 Observational Designs

On June 5, 1981, the CDC’s Morbidity and Mortality Weekly Report published a brief notice about five young men in Los Angeles, all previously healthy, all hospitalized with Pneumocystis carinii pneumonia (Gottlieb et al., 1981). Two of them had already died. The lead author, an immunologist at UCLA named Michael Gottlieb, had no idea what was killing them—but he knew the pattern was wrong. PCP was a disease of profoundly immunocompromised patients, not men in their twenties and thirties.

Observational research means studying the world as it is, without manipulating who gets exposed to what. You observe patterns, document associations, and—if you’re careful—make inferences about causation.

A month later, MMWR published a second report: Kaposi’s sarcoma, normally a rare cancer in the United States, had been diagnosed in 26 homosexual men, 20 in New York City and six in California, over the previous 30 months (Friedman-Kien et al., 1981). By September, The Lancet had described eight more cases of KS in young homosexual men in New York (Hymes et al., 1981). Neither the virus nor a name for the syndrome existed yet. But the alarm had been sounded, not by a randomized trial, but by physicians paying attention and documenting what they observed.

The case reports were only the start. CDC investigators then completed a case-control study comparing men with Kaposi’s sarcoma or PCP to matched controls, and they could already point to specific sexual practices and partner counts that distinguished cases from controls (Jaffe et al., 1983). They couldn’t name the agent, but they could say it was sexually transmitted. That conclusion shaped the public health response and helped focus the search for a viral cause.

Cohort studies took over once an antibody test became available. A six-year follow-up of homosexual men in San Francisco mapped seroconversion rates, latency periods, and progression to AIDS, defining the natural history of an infection that had been unknown four years earlier (Jaffe et al., 1985). By the time the first randomized trial of zidovudine reached publication, observational research had already established that HIV caused AIDS, identified how it spread, and laid out who was most affected. The RCT could test a treatment because the science underneath it was already there.

Most global health research works this way. If you’ve been taught that randomized controlled trials sit at the top of the evidence hierarchy, you might see observational studies as what we settle for when we can’t randomize. That undersells them. Observational designs are often the right tool for the question—sometimes the only tool that’s ethical, feasible, or fast enough to matter.

And despite rumors to the contrary, causal inference is very much possible with observational data. In Chapter 7, we introduced two strategies for causal inference from non-experimental data (including data generated in observational studies): confounder control (closing backdoor paths through statistical adjustment) and design-based approaches (isolating exogenous variation). The quasi-experimental designs in Chapter 11 focus on the latter. This chapter is about the former—and about the full range of observational designs, from case reports that first raise the alarm to cohort studies that follow thousands of people over years.

12.1 Descriptive Designs: Noticing What Others Miss

CASE REPORTS AND CASE SERIES: THE CANARIES IN THE COAL MINE

In early 2015, physicians in Pernambuco State, Brazil, noticed something alarming. Women were giving birth to babies with unusually small heads—a condition called microcephaly. In previous years, the state had recorded about 6 cases in the early months. By December 2015, over 900 cases were reported in a single week (Teixeira et al., 2016).

This wasn’t a formal study with a protocol and power calculations. It was doctors paying attention and recognizing a pattern that shouldn’t exist. These observations—documented as case reports and case series—triggered Brazil’s national public health emergency response and eventually connected the dots to Zika virus transmission (Teixeira et al., 2016).

Case reports describe individual patients or events. Case series describe a group of patients with a shared characteristic (like, say, babies born with microcephaly in the same geographic region and time period). They’re the simplest form of observational research, and they’re valuable for exactly one thing: noticing unusual patterns and generating hypotheses.

Case reports are detailed descriptions of individual cases—often unusual presentations or rare conditions. Case series aggregate multiple similar cases to identify patterns.

They alert the medical community to new diseases, identify rare adverse effects of treatments, and describe the clinical features of novel conditions. The Zika case series generated the hypothesis that the virus caused microcephaly. But case reports cannot tell you how common something is (they have no denominator), cannot prove causation (they have no comparison group), and cannot rule out confounding. For all of that, you need different designs.

The Zika-microcephaly connection shows both the power and limits of descriptive epidemiology. The case series prompted urgent investigation, but researchers still needed to estimate the actual risk. Using surveillance data on live births and microcephaly notifications, one study estimated that the risk of microcephaly after Zika infection ranged from 0.03% to 3.42% depending on the region (Jaenisch et al., 2017).

Let’s pause on that range: 0.03% to 3.42% (Jaenisch et al., 2017). That’s more than a 100-fold difference. At the low end, roughly 1 in 3,300 Zika-exposed pregnancies would result in microcephaly. At the high end, roughly 1 in 29. If you’re a pregnant woman exposed to Zika, that range isn’t particularly reassuring—you need to know which end of the spectrum applies to you.

The geographic variation itself became a crucial clue. Why would risk vary so dramatically across Brazilian states? Several hypotheses emerged: differences in Zika virus strains, co-infections with other viruses, environmental cofactors, or even differences in case ascertainment and surveillance intensity. Each possibility required different analytical designs to investigate properly—exactly the kind of hypothesis that case series generate but cannot test.

This uncertainty spurred emergency responses well before precise estimates were available. Brazil developed laboratory and imaging protocols for suspected cases (Jaenisch et al., 2017). Public health authorities declared a national emergency. Pregnant women received updated guidance about mosquito protection and travel restrictions. None of these responses waited for definitive causal proof—the descriptive evidence of an alarming spike in cases was enough to mobilize resources, alert clinicians, and begin investigating potential exposures.

The initial case series didn’t need to be perfect; it needed to be timely and credible enough to sound the alarm. That’s the fundamental value proposition of descriptive epidemiology during emerging disease outbreaks: rapid detection trumps precise estimation when time matters.

Stop and Ask Yourself

Why can’t we calculate risk from a case series alone? Think about what you’d need to know to turn the number of microcephaly cases into a risk estimate.

The answer, of course, is that you need the denominator: how many Zika-exposed pregnancies didn’t result in microcephaly? Case series give you the numerator (bad outcomes) but not the denominator (total exposed population). You need different designs for that.

PUBLIC HEALTH SURVEILLANCE: WATCHING POPULATIONS

If case reports are spotting individual canaries, public health surveillance is monitoring the entire coal mine—systematically and continuously.

Surveillance systems track disease occurrence in populations over time, often using standardized case definitions and reporting procedures. They’re not research studies in the traditional sense (no hypothesis, no statistical analysis plan), but they generate the data that makes population-level research possible.

During the Ebola crisis, contact tracing was initiated for 26.7% of total cases, detecting 3.6% of new cases through this surveillance mechanism (Swanson et al., 2018). Contact monitoring succeeded 88% of the time, though this varied substantially between urban and rural districts. These numbers might seem modest, but they improved dramatically over time as teams learned what worked—insights possible only through systematic surveillance documentation.

Two main types of surveillance: Passive surveillance relies on routine reporting by health facilities. Active surveillance involves deliberate case-finding, like the door-to-door fever screening during Ebola.

The National Polio Surveillance Project in Uttar Pradesh, India, provides another example. The system tracked acute flaccid paralysis cases and vaccination coverage against wild polio transmission (Coates et al., 2013). This wasn’t a one-time study—it was ongoing surveillance integrated into program implementation, ultimately contributing to India’s success in eliminating wild polio.

Across sub-Saharan Africa, the Integrated Disease Surveillance and Response (IDSR) strategy illustrates how surveillance systems can be scaled across resource-limited settings (Alhassan et al., 2024). IDSR engages community health workers as frontline disease detectors, emphasizing stakeholder engagement and robust data exchange between local and national levels. This decentralized approach recognizes that effective surveillance depends on trust and participation from communities—not just sophisticated technology or centralized control.

The genius of IDSR lies in its multi-level architecture. Community health workers—who may have limited formal education but deep local knowledge—serve as the eyes and ears of the system. They recognize unusual patterns (more diarrhea cases than normal, children dying of fever, animals behaving strangely) and report upward. District health offices aggregate these reports, verify suspected outbreaks, and coordinate rapid response teams. National epidemiology centers analyze trends, allocate resources, and communicate back down to communities (Alhassan et al., 2024).

This isn’t just data flowing up a hierarchy. It’s a bidirectional system where community concerns drive investigation and national findings inform local action. When communities see that their reports lead to actual responses—vaccine campaigns, clean water interventions, mosquito control—they remain engaged. When they don’t see responses, the system degrades.

The key enabler is stakeholder engagement at every level: health workers need training and supervision, district officers need investigation protocols and rapid response capacity, national centers need analytic tools and communication channels, and communities need transparent feedback about what’s being done with the data they provide. Miss any of these elements and the system fails.

IDSR priorities include epidemic-prone diseases (measles, cholera, meningitis, yellow fever), diseases targeted for eradication (polio, guinea worm), and diseases of public health importance (malaria, TB, HIV).

What distinguishes surveillance from other observational designs is that it’s continuous (always on, not a one-time study), population-based (covering defined geographic areas), standardized (using consistent case definitions over time), and action-oriented (designed to trigger public health responses, not just produce publications). But surveillance shares the same fundamental limitation as case series: it describes what’s happening, but it can’t tell you why. For that, you need analytical designs.

FROM DESCRIPTION TO ANALYSIS: A WORKED EXAMPLE

To see how surveillance data can transition from describing a problem to testing hypotheses about its causes, trace a hypothetical outbreak.

Suppose a district health office in rural Tanzania notices an uptick in bloody diarrhea cases reported by community health workers over a two-week period, and the IDSR system flags it as exceeding the expected threshold for dysentery. A rapid response team investigates and documents, say, 47 cases across three villages. The case series might show that 85% of cases are children under five, that cases cluster in households along one side of a river, that most affected children drank water from a specific community well, and that onset dates suggest the outbreak started about ten days ago.

The geographic clustering and water source association generate testable hypotheses. The community well might be contaminated with Shigella. Contamination might have resulted from flooding two weeks earlier that damaged the well structure. Households on the affected side of the river might have less access to improved sanitation, making contamination more likely to spread. Each hypothesis points to a different analytical design. A case-control study could match affected children to unaffected children in the same villages, comparing water sources, sanitation access, and hygiene practices. An environmental assessment could test the suspect well and other water sources for bacterial contamination, inspect well structure for damage, and map sanitation facilities relative to water sources. A cross-sectional survey could assess diarrhea prevalence, water sources, and sanitation access across all households in the affected villages.

Suppose the team does all three. The case-control study establishes that drinking from the suspect well dramatically increased dysentery odds. Environmental testing confirms Shigella in the well. The cross-sectional survey shows that households without latrines cluster near the contaminated well. The well is decommissioned, households receive water purification tablets, a latrine construction program begins, and the IDSR system continues monitoring to confirm the outbreak has ended.

This progression—from surveillance noticing unusual cases, through hypothesis generation, through analytical epidemiology testing specific causal claims, to public health action—shows how good surveillance systems function as the foundation for rapid epidemiological investigation that blends descriptive and analytical methods in real time.

12.2 Cross-Sectional Studies: Snapshots of Health and Disease

THE ONE-TIME PHOTOGRAPH

Imagine you walk into a village and measure everyone’s height, weight, blood pressure, and malaria status on the same day. That’s a cross-sectional study: exposure and outcome measured at the same point in time.

Cross-sectional means “cutting across” the population at one time point, like taking a photograph. Everyone is assessed simultaneously (or within a short time window).

Cross-sectional studies are workhorses in global health because they’re relatively quick and cheap. You don’t need to follow anyone over time. You measure what you measure, analyze it, and you’re done.

A single wave of a national survey can reveal stark regional inequalities. A multilevel analysis of the 2016 Ethiopian Demographic and Health Survey found that fewer than one in five mothers (17.6%, 95% CI 15.6–19.6) in Ethiopia’s pastoral “emerging regions” practiced exclusive breastfeeding, with significant variation by region (Gebremedhin et al., 2021). These geographic differences—captured efficiently in a single survey round—provide exactly the kind of actionable intelligence needed to target interventions where they’ll have the greatest impact.

Cross-sectional studies excel at estimating prevalence (what percentage of the population has the condition right now?), identifying associations (are certain factors more common among people with the disease?), assessing population health status for planning purposes, and generating hypotheses worth investigating with stronger designs. But there’s a fundamental problem with cross-sectional studies for causal inference, and it’s all about timing.

THE CHICKEN-AND-EGG PROBLEM

Remember from the causal inference chapter that temporality is essential for causation? The cause has to come before the effect. But in cross-sectional studies, you’re measuring everything at once. You don’t know which came first.

Let’s say you find that people with depression are less physically active. Plausible explanation: depression reduces motivation to exercise. Also plausible: lack of exercise leads to depression. Also plausible: some third factor (chronic illness? socioeconomic stress?) causes both.

Without knowing the temporal sequence—which came first, the depression or the inactivity—you cannot make causal claims. You can only describe associations.

This is why cross-sectional studies often report prevalence rather than incidence. Prevalence is the proportion of people who have the condition at one time point. Incidence is the rate at which new cases occur over time—and for that, you need to follow people longitudinally.

PANEL STUDIES AND REPEATED CROSS-SECTIONS

There are clever variations on the basic cross-sectional design that add temporal dimension without full longitudinal follow-up.

Repeated cross-sections survey different people from the same population at multiple time points. The Demographic and Health Surveys (DHS) work this way. You can track population-level trends (is child mortality decreasing in Ethiopia?) even though you’re not following the same children over time.

A study using Ethiopian DHS data over two decades explored seasonal patterns in births, finding that the median birth interval was notably longer in urban areas (38.6 months) versus rural areas (35.1 months) (Bezabih et al., 2025). This kind of temporal and geographic variation can inform resource allocation without the expense of following specific women for years.

Panel studies or longitudinal surveys follow the same individuals over time, measuring them repeatedly. These start to blur the line between cross-sectional and cohort designs. They let you see how individual-level factors change and relate to outcomes, making causal inference more plausible (though still not as clean as in prospective cohorts, as we’ll see).

Prevalence = proportion with the condition now Incidence = rate of new cases over time

You need follow-up to measure incidence!

PREVALENCE VS. INCIDENCE: WHY THE DISTINCTION MATTERS

Let’s pause here to clarify a fundamental concept that trips up many students: the difference between prevalence and incidence.

Prevalence is the proportion of people who have the condition at one point in time. If you survey 1,000 adults and 150 have hypertension, the prevalence is 15%. It’s a snapshot.

Incidence is the rate at which new cases develop over a specified period. If you follow 850 healthy adults for one year and 85 develop hypertension, the incidence rate is 85/850 = 10% per year (or, more precisely, 10 cases per 100 person-years).

Here’s why this matters: prevalence depends on both incidence and duration. Conditions that develop slowly but last a long time (like diabetes) have high prevalence relative to incidence. Conditions that develop quickly but resolve quickly (like common colds) have low prevalence relative to incidence.

Cross-sectional studies naturally measure prevalence. To measure incidence, you need longitudinal follow-up—which brings us to cohort studies (covered in the next section).

Stop and Ask Yourself

You conduct a survey and find that 30% of adults have chronic lower back pain. Six months later, you survey a different sample from the same population and find 32% have back pain.

Can you conclude that back pain incidence increased? Not quite. Prevalence changed, but you don’t know if this reflects more people developing back pain (higher incidence) or fewer people recovering (longer duration). To measure incidence, you’d need to follow pain-free individuals and track new cases over time.

WHEN TO CONSIDER THIS DESIGN

Cross-sectional studies are the natural choice when prevalence estimation is the goal—you want to know how common a condition is in a population right now. They’re also the right tool for baseline or needs assessments before launching an intervention: what’s the current vaccination coverage, nutrition status, or healthcare utilization pattern? When resources and time are limited, the ability to collect data in weeks rather than years makes cross-sectional designs feasible when nothing else is. And repeated cross-sectional surveys like the DHS track population health trends over time without following specific individuals, making them powerful tools for monitoring progress.

HOW TO STRENGTHEN THE DESIGN

The quality of a cross-sectional study depends heavily on decisions made before data collection begins. Use validated measurement instruments with known reliability—this reduces measurement error and allows comparison across settings. Achieve high response rates, because non-response bias is a serious threat; document who refused and compare them to participants if possible. Sample representatively using probability sampling when feasible, and be explicit about limitations when convenience sampling is unavoidable.

Even though cross-sectional designs can’t establish causation, you can strengthen causal plausibility by collecting detailed information about exposure timing—ask when exposures began, not just whether they’re present. Triangulate findings with other data sources (administrative data, qualitative interviews, repeated surveys) and pre-specify your analysis plan to avoid data dredging.

Beware of survival bias in cross-sectional studies. You only capture people alive and accessible at the survey time. Severe cases who died are missing, biasing your sample toward milder cases.

STRENGTHS AND LIMITATIONS

Cross-sectional studies are fast, relatively inexpensive, and can measure many variables simultaneously. With proper sampling, they provide representative snapshots of population health and generate hypotheses for further investigation. They are the workhorses of descriptive epidemiology.

But they cannot establish temporal sequence—you don’t know whether the exposure came before the outcome—which makes them vulnerable to reverse causation. They cannot calculate incidence or true risk, only prevalence. And they capture only people alive and accessible at the time of the survey, meaning severe or fatal cases are systematically missing. A cross-sectional study can tell you that two things are associated. It cannot tell you that one caused the other.

12.3 Case-Control Studies: Working Backward from the Outcome

Let’s say you want to understand what caused a cholera outbreak in Yemen—a country experiencing active conflict, making prospective research nearly impossible. You could wait months to set up a cohort study, or you could work backward: find people who have cholera (cases), find similar people who don’t have cholera (controls), and ask both groups about their recent exposures.

This is the logic of case-control studies: start with the outcome, look back at exposures.

THE BASIC DESIGN

In a case-control study, you select cases (people with the disease or outcome of interest) and controls (people without the disease, ideally drawn from the same source population), then measure past exposures in both groups and compare. The question is whether cases were more likely than controls to have been exposed.

Cases have the disease. Controls don’t. The question is: did cases have different exposures in their past?

In Yemen, researchers investigating cholera identified specific risk factors like not washing khat before chewing it, and use of common-source water (Dureab et al., 2019). This was a matched case-control study conducted in an active conflict zone—about as challenging a research environment as you can imagine. Yet by carefully matching controls to cases on geographic and demographic variables, researchers generated actionable findings about cholera transmission routes.

Case-control studies shine when the outcome is rare (you’d need enormous cohorts to capture enough cases), when you need answers quickly (no waiting years for outcomes to develop), when the outcome has already occurred (perfect for investigating outbreaks), or when multiple exposures might matter (you can assess many risk factors at once). But they introduce some distinctive challenges.

THE ART OF SELECTING CONTROLS

The hardest part of a case-control study is choosing controls. The fundamental question: controls should represent the source population that gave rise to the cases. If a control had developed the disease, would they have been identified as a case?

Get this wrong and you introduce selection bias—one of the most pernicious threats to case-control studies.

Here’s a classic mistake: You’re studying risk factors for hospitalized pneumonia. You select pneumonia patients as cases. Then you select other hospitalized patients (say, surgery patients) as controls. Problem: hospitalized patients aren’t representative of the population that produced your pneumonia cases! If you’re studying a risk factor like smoking, and smoking also increases surgery risk, you’ll underestimate smoking’s effect on pneumonia.

The solution: controls should come from the same population (geographic area, health system, etc.) that cases came from, before they got sick. Often this means community controls rather than hospital controls.

MATCHING: A DOUBLE-EDGED SWORD

Matching means selecting controls to be similar to cases on certain variables—age, sex, neighborhood, etc. The Yemen cholera study matched controls on geographic and demographic factors (Dureab et al., 2019). This wasn’t just methodological sophistication—in an active conflict zone, matching on neighborhood ensured that cases and controls faced similar water infrastructure, sanitation conditions, and access to health services.

Why match? To control confounding. If age is associated with both the exposure and the outcome, and you don’t match on age, you might see spurious associations.

But matching has a cost: you cannot assess whether the matched variables are risk factors themselves. If you match on age, you can no longer study whether age affects cholera risk. That information is locked away by design.

Individual matching: Each case gets one or more controls matched on specific characteristics (1:1, 1:2, or 1:n matching).

Frequency matching: Controls are selected so the group distribution matches cases (e.g., same age distribution) without pairing specific individuals.

Let’s compare the two main matching approaches:

Feature	Individual Matching	Frequency Matching
Control selection	Each case paired with specific control(s)	Controls selected to match overall distribution
Analysis complexity	Requires conditional logistic regression	Can use standard logistic regression
Efficiency	More powerful (tighter matching)	Less powerful but simpler
Flexibility	Hard to add more cases/controls later	Easy to adjust sample sizes
Best for	Small studies, rare diseases	Large studies, common diseases
Confounding control	Excellent for matched variables	Good but less precise

There’s also a temptation to over-match—to match on too many variables or, worse, to match on variables that are actually consequences of the exposure. This is one of the most common mistakes in case-control design, and it’s worth understanding why it’s so damaging.

Imagine you’re studying whether smoking causes lung cancer and you match on chronic cough (which smoking causes). What happens? Smokers with chronic cough get matched to non-smokers with chronic cough. But non-smokers with chronic cough probably have other lung problems that might also increase cancer risk. You’ve artificially made your smokers and non-smokers more similar on unmeasured lung disease severity, reducing the apparent association between smoking and cancer.

Here’s the rule: never match on anything in the causal pathway between exposure and outcome. Draw a DAG if you’re uncertain. Matching variables should be confounders (common causes of exposure and outcome), not mediators (consequences of exposure that lead to outcome) or colliders (consequences of both). Matching on a variable strongly correlated with the exposure itself causes the same kind of damage—you’ve made your exposed and unexposed groups artificially similar on something that should be different, and the apparent effect shrinks toward the null.

And here’s a crucial technical point: if you match, you must account for it in your analysis (using conditional logistic regression or similar methods). Reviews of published case-control studies show this is often done wrong (Cerceo et al., 2009). Matched design, unmatched analysis—a recipe for bias.

Why? Because matching breaks the independence between observations. Case-control pairs are linked by design. Standard logistic regression assumes all observations are independent. Use the wrong analysis and you’ll get incorrect standard errors, wrong p-values, and misleading confidence intervals.

RECALL BIAS AND OTHER THREATS

Because case-control studies ask people to remember past exposures, they’re vulnerable to recall bias. People with disease might search their memories more intensely or interpret past exposures differently than healthy controls.

Did you eat unwashed vegetables before you got sick? If you have cholera, you might remember every questionable salad. If you don’t have cholera, you might shrug and say, “probably?” This differential recall can create spurious associations.

One strategy to minimize recall bias is to use objective exposure measures when possible—medical records, employment records, environmental measurements—rather than relying on participant memory. But in many global health settings, especially during emergencies, these records don’t exist. You work with what you have.

Case-control studies also face information bias when exposure assessment differs between cases and controls. If interviewers know who has the disease, they might probe more thoroughly for exposures in cases than in controls. That’s why blinding interviewers to case status—when possible—is good practice.

NESTED CASE-CONTROL STUDIES: GETTING THE BEST OF BOTH WORLDS

There’s a clever hybrid design worth knowing about: the nested case-control study, which starts with a cohort but analyzes it using case-control logic. You define a cohort and follow them prospectively (or use existing cohort data); when cases occur, you select controls from within the same cohort—people who haven’t developed the outcome yet; you measure exposures, often from stored biospecimens, for cases and matched controls; and you analyze the result as a case-control study.

The reason to bother is that this combines the strengths of both designs. You don’t need to measure expensive biomarkers on every cohort member, just on cases and selected controls, which keeps costs manageable when exposure assessment requires specialized lab tests or genomic sequencing. Exposures were measured before disease occurred, eliminating recall bias. And like a full cohort, you know exposure preceded outcome.

Nested case-control studies are particularly common in large prospective cohorts where investigators have collected and frozen blood samples. When a research question emerges years later, they can measure novel biomarkers in stored samples from cases and controls without the expense of assaying every cohort member. The key requirement: you must have an existing cohort with stored biological samples or detailed baseline data. You can’t nest a case-control study in a cohort that doesn’t exist yet.

Stop and Ask Yourself

Think about the Yemen cholera study in an active conflict zone. What practical challenges might researchers have faced in selecting appropriate controls? How might those challenges introduce bias?

STRENGTHS AND LIMITATIONS

Case-control studies are the design of choice for rare diseases and outbreak investigations. They’re efficient (you start with cases that already exist), can assess multiple exposures simultaneously, and deliver results faster and cheaper than cohort studies. They’re particularly useful when the disease has a long latency period, because you don’t have to wait decades for outcomes to develop.

The costs are real. Case-control studies cannot calculate incidence or absolute risk because they have no denominator—you started with cases, not with a population at risk. They’re vulnerable to recall bias, because people with disease search their memories more intensely for explanations than healthy controls do. Control selection is difficult and easily biased; getting it wrong can invalidate the entire study. And for chronic exposures, the temporal sequence can be ambiguous even when the design is otherwise sound.

If you match cases to controls, you must account for it in the analysis using conditional logistic regression. Standard logistic regression with matched data gives wrong standard errors, wrong p-values, and misleading confidence intervals (Cerceo et al., 2009).

12.4 Cohort Studies: Following People Forward in Time

Now let’s flip the case-control logic on its head. Instead of starting with disease and looking backward to exposures, we start with exposures and follow people forward to see who develops disease.

This is the cohort study: you identify a group of people who share some characteristic (or vary on an exposure of interest), and you follow them over time to see what happens.

DESIGN FUNDAMENTALS: PROSPECTIVE VS. RETROSPECTIVE

There are two main flavors of cohort studies, distinguished by when you start following people relative to when you start the study.

Prospective cohort studies define the cohort in the present and follow them into the future. You measure exposures now, then wait months or years to see who develops the outcome. This is the classic cohort design. You’re there from the beginning, watching the story unfold in real time.

Retrospective cohort studies (also called historical cohort studies) use existing records to reconstruct what happened in the past. You identify a cohort and their exposures using historical data, then follow their outcomes through records up to the present. Think of it like reading a history book—the events already happened, but you’re piecing together the narrative from documented evidence.

Prospective cohort: Define cohort now \(\rightarrow\) follow forward \(\rightarrow\) measure outcomes later

Retrospective cohort: Use past records \(\rightarrow\) reconstruct exposures \(\rightarrow\) trace outcomes through present

Let me show you what retrospective cohort studies look like in practice, because they’re remarkably powerful tools for resource-limited settings where waiting years for prospective data isn’t feasible.

A retrospective cohort study in KwaZulu-Natal, South Africa, examined treatment outcomes among individuals on antiretroviral therapy using medical records (Muzumbukilwa et al., 2024). The researchers didn’t need to wait years for prospective follow-up—they reconstructed the cohort’s history from existing clinical data, tracking patients from ART initiation through their treatment journey.

What they found should give pause to anyone assuming ART programs automatically deliver optimal outcomes. Virologic suppression rates fell below UNAIDS targets, and mortality burdens remained substantial even among those on treatment. These weren’t just statistics—they represented real gaps between program intent and patient outcomes, gaps that might have remained hidden without systematic cohort analysis.

The beauty of the retrospective approach here is speed. Instead of launching a new prospective cohort and waiting a decade for answers, researchers could analyze ten years of outcomes in the time it took to extract and clean medical records. But the limitation is stark: they were constrained to whatever data clinicians had originally recorded. If a potential risk factor wasn’t documented in patient files, it couldn’t be studied—no matter how important it might be.

Similarly, a ten-year retrospective cohort in Ethiopia examined opportunistic infections among 515 people initiating antiretroviral therapy (Woldegeorgis et al., 2022). Using hospital records, researchers could analyze a decade of disease progression and treatment outcomes without waiting a decade to conduct the study. The corrected sample size—515, not 537 as sometimes incorrectly reported—demonstrates the importance of careful record validation when working with historical data.

Here’s the fundamental trade-off with retrospective cohorts: speed and efficiency versus data completeness and quality. You gain years of time, but you sacrifice control over what gets measured and how accurately it’s recorded.

Pro tip: When reading retrospective cohort studies, always check what percentage of eligible medical records were excluded due to missing data. High exclusion rates suggest the findings might not generalize to all patients.

Let’s break down the key differences between prospective and retrospective cohort designs:

Feature	Prospective Cohort	Retrospective Cohort
Timeline	Define cohort \(\rightarrow\) wait \(\rightarrow\) measure outcomes	Use past records \(\rightarrow\) reconstruct history
Duration	Years to decades	Weeks to months (for analysis)
Cost	Very expensive	Relatively inexpensive
Data quality	High (you control it)	Variable (depends on records)
Exposure measurement	Precise, standardized	Limited to what was recorded
Bias risk	Loss to follow-up	Selection bias, information bias
Outcome ascertainment	Complete, standardized	Depends on record quality
Best for	Rare exposures, precise measurement needed	Rare outcomes, quick answers needed

Stop and Ask Yourself

Why might a retrospective cohort study be particularly useful for studying HIV treatment outcomes in resource-limited settings? Think about the practical constraints on launching new prospective studies in these contexts.

The answer involves both scientific and logistical considerations. First, you need a decade of follow-up to properly assess long-term ART outcomes—mortality, treatment failure, virologic suppression. Waiting another decade for prospective data means today’s patients continue receiving sub-optimal care while researchers wait for results. Second, many resource-limited settings now have electronic medical record systems with years of accumulated patient data. Those data represent a goldmine for understanding what’s actually happening in real-world treatment programs, not idealized trial settings. Third—and this matters—retrospective studies don’t require continued patient contact, circumventing challenges with high mobility and loss to follow-up that plague prospective cohorts in these settings.

ADVANTAGES OVER CASE-CONTROL

Cohort studies offer several advantages when feasible. Because you measure exposure before the outcome occurs (or verify this through records), you establish the temporal sequence that case-control studies struggle with. You can calculate true incidence rates—how often disease occurs in exposed vs. unexposed groups—because you have both numerators (cases) and denominators (person-time at risk). You can study multiple outcomes from a single exposure: a cohort of smokers can examine lung cancer, heart disease, stroke, and COPD simultaneously. And because exposures are measured in the present or from objective records, cohort studies are less vulnerable to recall bias than case-control designs.

These advantages come at a cost. Cohort studies are expensive, time-consuming (especially prospective ones), and vulnerable to losses that case-control studies sidestep.

THREATS TO VALIDITY: WHAT CAN GO WRONG AND HOW TO SPOT IT

The central threat to cohort studies is loss to follow-up. People move, drop out, die of other causes, or simply stop participating. If loss to follow-up is related to both the exposure and outcome, you’ve got a problem.

Imagine you’re following a cohort to study whether a new antiretroviral regimen reduces mortality compared to standard treatment. If people on the new regimen who start feeling sick are more likely to drop out and seek care elsewhere, you’ll underestimate mortality in that group. Your comparison is now biased because you’re missing precisely the people who matter most.

Loss to follow-up (LTFU) becomes a serious threat to validity when it’s differential—meaning dropout rates differ between exposure groups or between those who develop the outcome and those who don’t.

The South Africa ART cohort faced this challenge (Muzumbukilwa et al., 2024). In resource-limited settings with high mobility and fragmented health systems, maintaining contact with participants over years is genuinely hard. Researchers need to account for loss to follow-up in their analyses and, ideally, investigate whether people who drop out differ systematically from those who remain.

Here’s the uncomfortable truth about loss to follow-up: you can’t know for certain what happened to people you lost contact with. Maybe they died. Maybe they moved and are perfectly healthy. Maybe they switched to a different health facility. Each possibility changes your interpretation of results, but you’re making educated guesses.

Researchers handle LTFU in several ways, none perfect. Complete case analysis simply excludes lost participants—simple, but biased if dropout is differential. Sensitivity analyses re-run the analysis under different assumptions about what happened to lost participants (best case: none developed the outcome; worst case: all did). Statistical methods like multiple imputation or inverse probability weighting can account for LTFU if you’ve measured good predictors of dropout, but they can’t help with predictors you didn’t measure.

Conventional guidance in clinical epidemiology often uses simple thresholds for loss to follow-up. As summarized by David L. Sackett and colleagues in Clinical Epidemiology: How to Do Clinical Practice Research (Haynes et al., 2015), losses of <5% are typically considered unlikely to threaten validity, whereas losses >20% raise serious concern about bias. These heuristics are useful for quick appraisal, but they are not sufficient. As emphasized in Modern Epidemiology (Lash et al., 2021), the critical issue is not the magnitude of loss per se but whether it introduces selection bias—whether those lost differ systematically from those retained in ways related to both the exposure and the outcome. Even relatively small amounts of loss can bias estimates if missingness is informative, while larger losses may be less problematic if they are random or appropriately addressed analytically. Rigorous evaluation focuses on the mechanism and direction of missingness, not just its percentage.

Changes in exposure status present another subtle challenge that’s often overlooked. In a cohort studying dietary patterns and diabetes risk, people might change their diets during the follow-up period. Do you analyze them based on their baseline diet (which no longer reflects reality) or their current diet (which might have changed because they developed pre-diabetes or received counseling after a concerning blood test)? There’s no perfect answer—each approach has trade-offs.

This is the problem of time-varying exposures, and it’s everywhere in global health research. People start or stop smoking. Dietary patterns change with seasons or economic shocks. Medication adherence waxes and wanes. Bed net use varies across seasons. The classic approach—measuring exposure once at baseline—treats exposure as static when it’s actually dynamic. The sophisticated approach—measuring exposure repeatedly over time—lets you account for changes, but now you face the challenge of time-varying confounding. If health status affects both the exposure and the outcome, and health status changes over time, standard adjustment methods don’t work. You need specialized statistical methods (like marginal structural models or g-estimation) that are beyond the scope of this chapter but important to know exist.

Stop and Ask Yourself

Think about a cohort study following people’s physical activity levels and cardiovascular disease risk over 10 years. Why might measuring physical activity only at baseline be problematic? What could change over that decade?

The answer should make you uncomfortable. People’s activity levels change dramatically over 10 years: they age, develop joint problems, have children, change jobs, move to new neighborhoods. Someone highly active at baseline might become sedentary by year 5. Someone sedentary at baseline might join a running club at year 3. If you only measure at baseline, you’re essentially asking: “Does physical activity at one moment in time predict CVD risk a decade later?” That’s a very different question from “Does sustained physical activity over time reduce CVD risk?”—which is probably what you actually care about.

Confounding remains a threat in cohort studies, just as in all observational designs. Yes, you’ve established temporal sequence—exposure came before outcome—but that doesn’t mean other factors aren’t driving the association. We’ll return to this in a moment when we discuss confounding control strategies across all observational designs.

WHEN TO CONSIDER THIS DESIGN

Cohort studies are the natural choice when you need to establish that exposure preceded outcome—the cornerstone of causal inference that cross-sectional designs can’t provide. They’re also the only observational design that calculates true incidence rates: how often new cases occur among exposed vs. unexposed groups. If risk estimation is your goal, you need a cohort.

The design is particularly efficient when the exposure is rare (it’s easier to recruit exposed individuals and follow them forward than to wait for rare cases to appear) or when multiple outcomes are of interest (a cohort of smokers can examine lung cancer, heart disease, stroke, and COPD from a single recruitment effort). And when electronic medical records, employment databases, or disease registries exist with years of accumulated data, retrospective cohorts can deliver decades of follow-up in months of analysis time.

HOW TO STRENGTHEN THE DESIGN

The most important investment is minimizing loss to follow-up. Use multiple contact methods, maintain relationships with participants, track reasons for dropout, and compare baseline characteristics of those who remain with those who leave. You can only adjust for confounders you measured, so identify potential confounders from prior literature and causal diagrams before data collection begins—not after.

When exposures might change over time, plan for repeated measurement. A single baseline assessment of diet, medication adherence, or physical activity treats a dynamic exposure as static, and the resulting estimates answer a different question than the one you care about. Use objective outcome ascertainment when possible—blind assessors to exposure status, use standardized diagnostic criteria, and link to registries or administrative data. Pre-register your analysis plan to avoid selective reporting, and plan sensitivity analyses that address assumptions about LTFU, unmeasured confounding, and measurement error.

Establishing temporal sequence doesn’t eliminate confounding. Knowing that exposure preceded outcome still leaves open the possibility that unmeasured common causes drive both.

STRENGTHS AND LIMITATIONS

Cohort studies are the strongest observational design for causal inference. They establish temporal sequence, calculate true incidence and risk measures, can study multiple outcomes from a single exposure, and are less vulnerable to recall bias than case-control designs because exposures are measured before disease occurs.

The costs are substantial. Prospective cohorts are expensive and time-consuming, often requiring years of follow-up and large teams to maintain. Loss to follow-up is a constant threat, and if it’s differential—related to both exposure and outcome—it biases results in ways that are difficult to detect. Cohort studies are inefficient for rare outcomes (you’d need enormous samples to observe enough cases). Time-varying exposures complicate the analysis. And retrospective cohorts, while faster, are limited to whatever data clinicians or administrators originally chose to record.

12.5 Confounding Control Across Designs

Let’s circle back to the central challenge in all observational research: confounding. Remember from the causal inference chapter that a confounder is a variable associated with both the exposure and the outcome, creating a spurious association between them.

Each observational design attempts to address confounding, but their strategies differ.

DESIGN-PHASE STRATEGIES

Cross-sectional studies offer limited confounding control by design. You can stratify your sample to ensure you have variation across potential confounders, but fundamentally you’re measuring everything simultaneously, making it hard to disentangle relationships.

Case-control studies use matching to control confounding at the design phase. By ensuring cases and controls are similar on potential confounders (age, sex, geography), you eliminate those variables as alternative explanations. But remember: matched variables can’t be studied as risk factors, and you must account for matching in your analysis.

Cohort studies can use restriction (only including people within a narrow range of the confounder—say, only women aged 40-45) or matching (selecting unexposed individuals to match exposed ones on potential confounders). Both strategies limit who enters your cohort to reduce confounding.

ANALYSIS-PHASE STRATEGIES

All observational designs can address confounding through statistical adjustment in the analysis phase.

Stratification means analyzing the exposure-outcome relationship separately within levels of the confounder. A study of air pollution and respiratory health in Delhi used this approach, examining associations separately for different sociodemographic groups (Siddique et al., 2011). If the association holds across all strata, confounding by that variable is less likely.

Let’s look at the Delhi example more carefully, because it illustrates how confounding control works in practice. Researchers wanted to understand whether outdoor air pollution affected children’s respiratory health. But here’s the problem: children from different socioeconomic backgrounds have different pollution exposures and different baseline health risks. Wealthier families might live in less polluted neighborhoods and have better access to healthcare. Poorer families might experience both higher pollution and more crowding, indoor smoke exposure, and nutritional deficits—all of which affect respiratory health.

The study team matched participants on multiple sociodemographic factors before the analysis even began: age, sex, parental education, household income, and neighborhood characteristics (Siddique et al., 2011). This design-phase matching ensured that comparisons between high- and low-pollution exposure groups wouldn’t be confounded by these background variables.

But matching alone wasn’t enough. In the analysis phase, they used multivariate logistic regression to further adjust for potential confounders. This allowed them to estimate the effect of pollution while statistically “holding constant” other factors that might explain respiratory symptoms.

Think of it this way: stratification asks, “Is the pollution-health association present within each stratum of potential confounders?” If yes, confounding by that variable is unlikely. Multivariable regression asks, “What is the pollution-health association after accounting for all measured confounders simultaneously?” Both approaches aim to isolate pollution’s effect from other influences.

Multivariable regression allows you to adjust for multiple confounders simultaneously. The Delhi study used multivariate logistic regression to control for age, sex, education, and other factors while estimating pollution’s effect on respiratory outcomes (Siddique et al., 2011).

Multivariable vs. multivariate: Multivariable regression means multiple predictor variables (which is what we usually want). Multivariate regression means multiple outcome variables. These terms are often confused!

Let’s compare the main strategies for controlling confounding:

Strategy	Timing	How It Works	Advantages	Limitations	Best When
Randomization	Design	Randomly assign exposure, distributing confounders equally across groups	Controls measured and unmeasured confounders	Only possible in experiments	You can control who gets exposed
Restriction	Design	Only include people within narrow range of confounder (e.g., only men, ages 40-45)	Simple, transparent, eliminates confounding by restricted variables	Reduces generalizability, can’t assess effect modification	Few key confounders, homogeneous population acceptable
Matching	Design	Select comparison group to match exposed group on confounders	Controls for matched variables, efficient	Can’t study matched variables as outcomes, complex analysis required	Small studies, clear confounders known in advance
Stratification	Analysis	Analyze exposure-outcome within levels of confounder	Transparent, easy to explain and interpret	Limited to few confounders, small strata have low power	Few confounders, want to show effect within subgroups
Multivariable Adjustment	Analysis	Statistical model adjusts for multiple confounders simultaneously	Handles many confounders, standard approach	Model assumptions matter, only controls measured confounders	Many potential confounders, large sample size
Propensity Scores	Analysis	Estimate probability of exposure, match/stratify/weight on this score	Reduces many confounders to single score, mimics randomization	Complex, still only controls measured confounders	Very many confounders, imbalanced groups

Propensity score methods represent a more sophisticated approach. Instead of adjusting for a dozen confounders separately, you estimate each person’s propensity (probability) to be exposed based on their baseline characteristics, then match, stratify, or weight based on that single score (Assimon, 2021). This can help control for measured confounding without the complexity of full randomization.

Here’s how it works: First, you build a statistical model predicting exposure (not outcome!) based on all measured confounders. This gives each person a propensity score—their probability of being exposed given their characteristics. Then you use these scores to make exposed and unexposed groups comparable. You might match people with similar propensity scores, stratify by propensity score quintiles, or weight observations by the inverse of their propensity to be in their observed group.

Why go through all this? When you have many potential confounders, traditional multivariable regression can become unstable, especially with modest sample sizes. Propensity scores condense all that information into a single number, making matching and balance assessment more straightforward. They’re particularly useful in observational studies trying to mimic randomized trials (Assimon, 2021).

ADVANCED METHODS FOR CAUSAL INFERENCE FROM OBSERVATIONAL DATA

As observational research has matured, so have methods for extracting causal estimates from non-randomized data. These approaches aim to make observational studies approximate the conditions of randomized experiments—what methodologists call target trial emulation. While no statistical method can fully substitute for randomization (unmeasured confounding remains a fundamental limitation), these techniques represent our best tools for credible causal inference when experiments aren’t possible.

Inverse Probability of Treatment Weighting (IPTW)

Inverse probability of treatment weighting takes the propensity score concept further. Instead of matching or stratifying on propensity scores, IPTW creates a pseudo-population where exposure is independent of measured confounders—essentially reweighting the sample to mimic random assignment.

The intuition: if a person has a low probability of being exposed (say, 20%) but was actually exposed, they are “unusual” among exposed individuals. We upweight their contribution so they represent not just themselves but all the similar people who weren’t exposed. Conversely, someone with an 80% probability of exposure who was exposed is “typical” and gets less weight. The weight for each individual is the inverse of their probability of receiving the treatment they actually received: \(1 / P(\text{exposed} \mid \text{confounders})\) for exposed individuals, and \(1 / [1 - P(\text{exposed} \mid \text{confounders})]\) for unexposed individuals.

Stabilized weights divide each weight by the marginal probability of exposure to reduce extreme weights while preserving the pseudo-population’s overall size. This often improves precision without introducing bias.

IPTW has several advantages over propensity score matching. It uses every observation rather than discarding unmatched individuals, handles time-varying exposures well, and estimates population-level causal effects rather than effects within a matched subset. The trade-off is sensitivity to extreme weights. If some individuals have very low propensity scores but were actually exposed (or very high scores but weren’t), their weights become enormous and estimates become unstable. Always check the weight distribution—a handful of individuals with weights of 50 or 100 are warning signs.

In global health, IPTW is increasingly used to evaluate treatment effects from clinical cohorts: comparing mortality between patients started on different first-line ART regimens, or assessing the effectiveness of bednet distribution programs using household survey data, or comparing outcomes for patients who received different treatments in routine care.

Targeted Learning and TMLE

Targeted Maximum Likelihood Estimation (TMLE) combines machine learning with causal inference frameworks. Developed by Mark van der Laan and colleagues at Berkeley, TMLE addresses limitations of both regression and propensity score methods.

The key insight is that traditional methods optimize for the wrong thing. Standard regression minimizes prediction error across the entire outcome distribution, but we only care about the specific causal estimand—the average treatment effect, say. TMLE is “targeted” because it optimizes specifically for the parameter we want to estimate. It works in three stages: an initial outcome model estimates the relationship between exposure, confounders, and outcome using any prediction method, including machine learning algorithms like random forests or gradient boosting; a propensity score model estimates the probability of exposure given confounders; and a targeting step then updates the initial outcome model using a covariate derived from the propensity score, optimizing for the causal effect estimate rather than for prediction.

Super Learner is often used with TMLE to build both the outcome and propensity models. It combines multiple machine learning algorithms through cross-validation, selecting the optimal weighted combination. This reduces dependence on any single modeling assumption.

Why bother with the complexity? TMLE has three properties simpler methods lack. It is doubly robust—estimates remain consistent if either the outcome model or the propensity model is correctly specified, where traditional regression requires the outcome model to be right and IPTW requires the propensity model to be right. It is compatible with machine learning, handling situations where you don’t know the functional form relating confounders to outcome while still providing valid confidence intervals and p-values. And among doubly robust estimators, it is efficient: it extracts as much information from the data as possible given what we’re willing to assume.

The downside is complexity. Implementation requires specialized software (the tmle and ltmle packages in R, or zEpid in Python), and interpretation requires understanding causal inference concepts. But for high-stakes analyses where credibility matters, the investment is worthwhile. TMLE has been used to estimate effects of vaccination programs on disease incidence, evaluate the impact of food assistance on nutrition outcomes, and assess community health worker interventions using observational data from implementation settings.

Other Causal Inference Methods

Several other methods strengthen causal inference from observational data by exploiting natural experiments or policy variation. Instrumental variables use external factors that affect exposure but not the outcome directly. Regression discontinuity exploits sharp eligibility cutoffs. Difference-in-differences compares changes over time between treated and untreated groups. These approaches are covered in Chapter 11, as they occupy a middle ground between pure observation and true randomization. When you have access to a valid instrument, a sharp policy cutoff, or a natural experiment with a comparison group, these designs can provide stronger causal evidence than propensity-based methods alone.

Choosing Among Methods

Method	Best When	Key Assumptions	Limitations
Multivariable regression	Few confounders, known functional form	Correct model specification	Model-dependent, unmeasured confounding
Propensity score matching	Many confounders, want intuitive balance	Correct propensity model, no unmeasured confounding	Discards unmatched observations
IPTW	Many confounders, time-varying exposures	Correct propensity model (or bounded weights)	Sensitive to extreme weights
TMLE	Complex confounding, want robust inference	Either outcome or propensity model correct	Complexity, requires specialized software

For methods that exploit natural experiments (IV, RD, DiD), see Chapter 11.

The honest answer is that no purely observational method eliminates unmeasured confounding. The best approach combines thoughtful study design (measuring important confounders) with appropriate methods and honest acknowledgment of limitations. Sensitivity analyses—asking “how strong would unmeasured confounding need to be to explain away our findings?”—help quantify robustness.

But here’s the uncomfortable truth: statistical adjustment only controls for confounders you’ve measured. If there’s an unmeasured variable associated with both exposure and outcome, no amount of fancy statistics will save you. This is why observational studies—no matter how well-designed—can never prove causation the way randomized trials can. They can provide strong evidence, compelling evidence, actionable evidence. But there’s always the possibility of unmeasured confounding.

Stop and Ask Yourself

Why can’t statistical adjustment control for unmeasured confounders? Think about what you’d need to know to adjust for a variable.

The answer is straightforward: to adjust for a variable, you need data on that variable. If household cooking fuel type is a confounder but you didn’t measure it, you can’t include it in your model. It doesn’t matter how sophisticated your statistical methods are—they can only work with the data you have. This is why careful thinking about which variables to measure during study design is just as important as choosing the right analysis strategy.

12.6 Choosing the Right Observational Design

You have a research question. You have determined that a randomized trial isn’t feasible. How do you choose among the observational options?

Start with the kind of question you are asking. Descriptive questions about how common something is, or what patterns exist, point toward case reports, surveillance, and cross-sectional designs. Questions about associations between exposures and outcomes can be tackled with cross-sectional, case-control, or cohort studies. Causal questions push you toward case-control or cohort designs with careful attention to confounding—and even then, the evidence is weaker than a randomized trial or quasi-experiment would provide.

Then consider the outcome. Rare outcomes push you toward case-control studies, because you would need enormous cohorts to capture enough cases prospectively. Common outcomes can be studied efficiently with cohort or cross-sectional designs.

Timing matters next. If you need answers in weeks, case-control or cross-sectional designs are the realistic options. If you can wait years, a prospective cohort gives you the cleanest temporal sequence. If you want long-term outcomes without the wait, a retrospective cohort can deliver decades of follow-up in months of chart review—provided the records exist and capture what you need.

Resources and exposure characteristics fill in the rest. Limited budgets favor cross-sectional and case-control designs over prospective cohorts that demand years of follow-up infrastructure. Past exposures with rare outcomes point to case-control. Current exposures you want to follow forward point to prospective cohort. Exposures that vary over time call for longitudinal or panel designs that can capture the variation rather than treating it as static.

These factors don’t always converge on a single answer, and they shouldn’t. The point is to make the trade-offs explicit, so the design you end up with is a choice rather than a default.

SCENARIO 1: CHOLERA RISK FACTORS IN A CONFLICT ZONE

An active cholera outbreak is unfolding in Yemen during ongoing conflict, with cases appearing daily and public health teams needing answers within weeks. The outcome is rare even in outbreak conditions, the timeline is compressed, resources are extremely limited, and the question is concrete: what specific exposures are driving transmission?

A case-control study is the realistic choice (Dureab et al., 2019). Assembling a prospective cohort takes too long, and a cross-sectional survey is impractical during active conflict. But cases are already presenting to treatment centers, and they can be matched quickly to neighbors who haven’t gotten sick. Interviewers ask both groups about water sources, food preparation, and hygiene practices, and actionable answers arrive in weeks rather than years. Geographic matching ensures cases and controls faced similar conflict exposure, water infrastructure, and sanitation conditions—controlling for major confounders while keeping data collection feasible.

SCENARIO 2: LONG-TERM HIV TREATMENT OUTCOMES

The question is which patients on antiretroviral therapy are achieving viral suppression and which are failing treatment. The outcome develops over years, and waiting another decade for prospective data means today’s patients continue receiving sub-optimal care while researchers wait for results. Resources are moderate, and clinical records already exist, though they require extraction and cleaning.

A retrospective cohort fits (Muzumbukilwa et al., 2024; Woldegeorgis et al., 2022). Medical records document ART initiation dates, CD4 counts, viral loads, and mortality going back years. Reconstruct the cohort from historical records, trace outcomes to the present, and ten years of follow-up data emerges from a few months of chart review. The trade-off is real: you are limited to whatever clinicians originally recorded, and variables that weren’t documented can’t be studied. In exchange, you gain a decade of time.

SCENARIO 3: POPULATION HEALTH PATTERNS ACROSS REGIONS

The goal is understanding regional variations in breastfeeding practices and maternal health outcomes across Ethiopia. The outcome is common, the question is descriptive rather than causal, and existing survey infrastructure makes a snapshot feasible without launching anything new.

A cross-sectional survey is the right tool (Gebremedhin et al., 2021). There is no need to prove causation—the goal is to characterize patterns clearly enough to target interventions where they will have the greatest impact. A well-designed cross-sectional survey captures regional variations efficiently, and following women over time would cost vastly more without adding value for a descriptive question. Stratified sampling ensures adequate representation from each region, and standardized questionnaires make comparisons across sites possible.

PRACTICAL CONSIDERATIONS THAT MATTER

Beyond these textbook decision factors, real-world constraints often drive design choices. Sometimes your choice is constrained by what data already exist; if health facilities have been collecting electronic records for years, retrospective designs become attractive even when prospective would be theoretically better. Ethical constraints rule out some options entirely—you can’t randomize mosquito net use to see if it prevents malaria, because withholding a life-saving intervention is unethical. Political and security realities can make prospective follow-up impossible regardless of what your protocol says, and existing infrastructure shifts the calculus too: building on a running surveillance system costs far less than launching something new. Your team’s skills matter as well. Running a cohort study requires long-term participant tracking, database management, and retention strategies; case-control studies require carefully designed control selection procedures. Lacking experience with one approach is a legitimate factor in choosing another.

There’s no single “best” design. The right choice depends on your question, your outcome, your timeline, your resources, and your context. The researchers behind the studies we’ve discussed made pragmatic choices based on real constraints—and still generated evidence that changed practice.

12.7 Reading Observational Studies Critically

The decisions that determine whether an observational study is credible mostly happen before any results appear—in how the comparison group was defined, how exposures were measured, how confounding was handled, and whether loss to follow-up was taken seriously. Reading the study well means looking for those decisions rather than getting swept up in the headline finding.

In case-control studies, start with the controls. Do the authors define a source population, and could a control have been identified as a case if they had developed the disease? Hospital controls for a community-acquired outcome are a warning sign, as are controls drawn for convenience rather than representativeness. If the design is matched, the analysis should say so—look for phrases like “conditional logistic regression” or “matched analysis.” A matched design analyzed with standard logistic regression is a red flag that the analysts didn’t account for the structure they imposed (Cerceo et al., 2009).

In cohort studies, loss to follow-up is the first thing to check. What percentage of the cohort was lost, and do the authors compare baseline characteristics of those who completed follow-up against those who didn’t? Substantial loss with no sensitivity analysis should put you on high alert. Differential loss—where people more likely to develop the outcome drop out at different rates by exposure status—can produce findings that look like effects but are really artifacts of who stayed in the study.

In cross-sectional studies, the language tells you whether the authors understand their own design. “Associated with” is honest. “Caused” or “leads to” is not, because cross-sectional data cannot establish temporal sequence. When you see causal language attached to a snapshot, the authors are overreaching.

Across all observational designs, the central question is confounding. What variables did the authors adjust for, and—more importantly—what didn’t they measure? Look for a discussion of unmeasured confounding rather than silence about it. Be wary of long lists of adjustment variables without a conceptual framework; it often signals fishing rather than thoughtful design. And ask whether the effect sizes make biological sense. Implausibly large effects from observational data more often reflect residual confounding than true causal mechanisms.

A study that survives this kind of scrutiny isn’t necessarily right. But a study that doesn’t is unlikely to be.

12.8 Observational Designs at a Glance

Design	Direction	Key Strength	Key Limitation	Best For
Descriptive	—	Fast, identifies outbreaks	No comparison group	Surveillance, hypothesis generation
Cross-Sectional	—	Prevalence estimation	Can’t establish temporality	Health surveys, needs assessments
Case-Control	Outcome → exposure	Efficient for rare diseases	Recall bias, control selection	Outbreaks, rare outcomes
Cohort (Prospective)	Exposure → outcome	Establishes temporality, calculates incidence	Expensive, loss to follow-up	Multiple outcomes, causal inference
Cohort (Retrospective)	Exposure → outcome	Uses existing records	Limited to recorded variables	Time-sensitive questions, long latency

12.9 Closing Reflection

The physicians who reported those first PCP and KS cases couldn’t randomize. They couldn’t wait. They couldn’t control every confounder. But they could observe, document, and write down what they saw—and those case reports launched a chain of observational research that defined how a new disease behaved, who it affected, and how it spread, long before any treatment trial became possible.

Every observational design in this chapter involves trade-offs. Cross-sectional studies sacrifice temporal clarity for speed. Case-control studies work backward from outcomes, efficient but vulnerable to recall bias. Cohort studies establish the temporal sequence that causal inference demands, but at a cost in time and resources that isn’t always available. None of these designs eliminate confounding. All require assumptions. The difference between a credible observational study and a misleading one is whether the investigators made those assumptions explicit, tested them where possible, and reported the limitations honestly.

In Chapter 7 we introduced the two-bucket framework: confounder control and design-based approaches. The quasi-experimental designs in Chapter 11 exploit exogenous variation. The observational designs in this chapter rely on the other bucket—measuring and adjusting for the things that confound your comparison. Both strategies have limits. Neither is sufficient alone. The best evidence often comes from combining them: a cohort study with careful confounder adjustment, triangulated against a natural experiment that sidesteps the confounding entirely. When two imperfect approaches, with different assumptions and different vulnerabilities, converge on the same answer, that’s when confidence in a causal claim grows.

Alhassan, J. A. K. et al. (2024). Public health surveillance through community health workers: A scoping review of evidence from 25 low-income and middle-income countries. BMJ Open.

Assimon, M. M. (2021). Confounding in observational studies evaluating the safety and effectiveness of medical treatments. Kidney360.

Bezabih, B. A. et al. (2025). Birth pattern seasonality in ethiopia: Evidence from national demographic and health survey data from 2000 to 2019. Women’s Health Reports (New Rochelle, N.Y.).

Cerceo, E. et al. (2009). Role of matching in case-control studies of antimicrobial resistance. Infection Control and Hospital Epidemiology.

Coates, E. A. et al. (2013). Successful polio eradication in uttar pradesh, india: The pivotal contribution of the social mobilization network, an NGO/UNICEF collaboration. Global Health, Science and Practice.

Dureab, F. et al. (2019). Risk factors associated with the recent cholera outbreak in yemen: A case-control study. Epidemiology and Health.

Friedman-Kien, A. et al. (1981). Kaposi’s sarcoma and Pneumocystis pneumonia among homosexual men—New York City and California. Morbidity and Mortality Weekly Report, 30(25), 305–308.

Gebremedhin, T. et al. (2021). Less than one-fifth of the mothers practised exclusive breastfeeding in the emerging regions of Ethiopia: A multilevel analysis of the 2016 Ethiopian demographic and health survey. BMC Public Health, 21, 1–11.

Gottlieb, M. S. et al. (1981). Pneumocystis pneumonia—Los Angeles. Morbidity and Mortality Weekly Report, 30(21), 250–252.

Haynes, R. B. et al. (2015). Clinical epidemiology: How to do clinical practice research. Wolters Kluwer Health.

Hymes, K. et al. (1981). Kaposi’s sarcoma in homosexual men—a report of eight cases. The Lancet, 318(8247), 598–600.

Jaenisch, T. et al. (2017). Risk of microcephaly after zika virus infection in brazil, 2015 to 2016. Bulletin of the World Health Organization.

Jaffe, H. W. et al. (1983). National case-control study of kaposi’s sarcoma and pneumocystis carinii pneumonia in homosexual men: Part 1, epidemiologic results. Annals of Internal Medicine, 99(2), 145–151.

Jaffe, H. W. et al. (1985). The acquired immunodeficiency syndrome in a cohort of homosexual men: A six-year follow-up study. Annals of Internal Medicine, 103(2), 210–214.

Lash, T. et al. (2021). Modern epidemiology (Vol. 4). Wolters Kluwer Health.

Muzumbukilwa, T. W. et al. (2024). Evaluation of treatment outcomes among individuals on highly active antiretroviral therapy in KwaZulu-natal, south africa. AIDS Research and Treatment.

Siddique, S. et al. (2011). Effects of air pollution on the respiratory health of children: A study in the capital city of india. Air Quality, Atmosphere, & Health.

Swanson, K. C. et al. (2018). Contact tracing performance during the ebola epidemic in liberia, 2014-2015. PLOS Neglected Tropical Diseases.

Teixeira, M. G. et al. (2016). The epidemic of zika virus-related microcephaly in brazil: Detection, control, etiology, and future scenarios. American Journal of Public Health.

Woldegeorgis, B. Z. et al. (2022). Incidence and predictors of opportunistic infections in adolescents and adults after the initiation of antiretroviral therapy: A 10-year retrospective cohort study in ethiopia. Frontiers in Public Health.