14 Other Designs and Approaches to Know
The previous chapters covered the workhorses: randomized trials, quasi-experiments, and observational studies. But not every research question fits neatly into those categories. The designs in this chapter share little with each other—different units of analysis, different data, different aims—except that each one arose because the standard toolkit couldn’t answer the question someone needed to ask.
The aim here is breadth, not depth. You don’t need to master any of these to use this chapter. You need to recognize what each is built to do so you can spot one in a paper, or know when to call someone who specializes in it.
14.1 N-of-1 Studies
Here’s a scenario you might face: You’re working with a patient—let’s call her Maria—who has a chronic condition that responds differently to treatments than the “typical” patient described in clinical trials. The trial evidence says Treatment A should work better than Treatment B, but Maria seems to do worse on A. How do you figure out what’s actually best for her?
Or put another way: What if the research question isn’t “What works on average for a population?” but rather “What works for this specific person?”
Enter the N-of-1 study, also called single-case design or single-subject research. Rather than comparing outcomes across many people, you systematically compare conditions within a single individual over time (Lobo et al., 2017).
THE LOGIC OF SINGLE-CASE DESIGN
Think of an N-of-1 study as a carefully choreographed experiment where one person serves as their own control. Instead of having Group A get Treatment 1 and Group B get Treatment 2, you have the same person experience both conditions in a structured sequence.
The classic structure looks like this:
- Baseline phase (A): Measure the outcome repeatedly without intervention
- Intervention phase (B): Introduce the treatment and continue measuring
- Return to baseline (A): Remove the treatment
- Reintroduce intervention (B): Apply the treatment again
This ABA or ABAB design lets you see if the outcome changes when—and only when—the intervention is present (Lobo et al., 2017). If Maria’s symptoms improve every time you introduce Treatment A and worsen when you remove it, you’ve got pretty good evidence that Treatment A is helping her, regardless of what the population-level trials showed.
WHAT MAKES THIS RIGOROUS?
You might be thinking: “Wait, isn’t this just…trying different things and seeing what happens?” Yes, but… The key difference is the systematic structure.
A well-designed N-of-1 study incorporates several features that strengthen causal inference:
- Multiple measurement points in each phase (not just before-and-after)
- Replication of the intervention effect across phases
- Randomization of when phases occur (when possible)
- Blinding of outcome assessors
When you add these elements together, you’re not just observing, you’re experimenting. Each transition between phases becomes a mini-test of your hypothesis.
Let’s break down how randomization strengthens these designs. In randomized phase-order designs, you don’t always require baseline to precede intervention. Instead, you can randomize the sequence of phases. This reduces threats to validity from temporal factors and history effects. If Maria’s symptoms improve every time she’s on Treatment A regardless of whether A comes first or second in the sequence, you’ve got stronger evidence that the treatment is causing the improvement rather than, say, seasonal changes or the natural course of her disease.
Another powerful variation is the multiple baseline design, where you systematically stagger when the intervention starts across different behaviors, individuals, or settings. Imagine you’re working with three patients with similar chronic conditions. You start the intervention with Patient 1 after two weeks, Patient 2 after four weeks, and Patient 3 after six weeks. If all three show improvement precisely when their intervention begins—and only then—you’ve replicated the effect three times at different time points. That’s pretty compelling evidence, even with just three people.
WHEN TO CONSIDER N-OF-1 DESIGNS
N-of-1 designs are most useful when the condition is rare enough that assembling a large trial would take years, when patients show substantial heterogeneity in treatment response and you need to optimize care for a specific individual, or when resources don’t permit a full-scale trial but you still need rigorous evidence. The condition needs to be chronic and stable enough that changes in the outcome likely reflect the intervention rather than natural disease progression—which rules out acute illnesses and rapidly evolving conditions.
REAL-WORLD APPLICATION
A study from Eldoret, Kenya offers a working example. Researchers wanted to test whether a brief, lay-counselor-delivered intervention could reduce harmful drinking and improve family functioning among fathers—a population for whom no proven intervention existed locally and one too dispersed to easily assemble into a conventional trial (Giusto et al., 2020). They turned to the multiple baseline design.
Nine men ages 30 to 48 who screened positive for problem drinking (median AUDIT score 17, in the harmful range) were randomized to staggered start dates for a five-session intervention combining behavioral activation, motivational interviewing, and gender role content aimed at fathers. Each man served as his own control, with weekly alcohol consumption measured by the Timeline Followback method across four weeks of baseline, the intervention period, and four weeks of follow-up.
Eight men completed treatment, and the effects replicated across the staggered starts. The odds of any given day being alcohol-free were 5.1 times higher posttreatment than at baseline (95% CI 3.3-7.9). On the days men did drink after treatment, they drank half as much (95% CI 0.39-0.65). Partners and children (ages 8–17) independently reported improvements in family functioning, and men reported reductions in depression symptoms—suggesting effects that extended beyond drinking itself. When nine separate baselines all show drops aligned with their individual treatment start dates, history or maturation are weak alternative explanations: the design produced individual-level evidence in a setting where a full randomized trial would have been impractical.
STRENGTHS AND LIMITATIONS
N-of-1 designs can provide strong causal evidence at the individual level, making them ideal when large samples aren’t feasible and when results need to be directly applicable to the person being studied. They can also be more ethical than assigning many patients to potentially inferior treatments when you only need to optimize care for one.
The trade-offs are real. Results don’t generalize to other individuals without replication. The design requires stable chronic conditions—highly variable conditions make it impossible to attribute changes to the intervention. Carryover effects between phases can confound results if treatments have lingering effects (build in washout periods). And establishing a stable baseline before intervening takes time, which means starting the intervention too quickly undermines the entire design.
14.2 Ecological Studies
Here’s a question you can’t answer with individual-level data: Are countries with higher per-capita sugar consumption seeing more type 2 diabetes? You don’t have anyone’s sugar intake on file. What you have is national consumption per capita and national diabetes prevalence—both aggregate. So your unit of analysis becomes the country, not the person.
That’s an ecological study: an analysis where the unit of observation is a group—a country, a city, a clinic, a year—rather than an individual. You give up depth and gain reach. You can compare hundreds of populations using data that’s already collected, but you can’t say what’s happening inside any one person.
WHERE THE DESIGN EARNS ITS KEEP
Ecological work shines when the exposure of interest is most naturally treated as a group-level property. Air quality, policy environments, health system capacity, climate variables—these aren’t individual attributes, they’re contextual ones. Asking do countries with hotter average temperatures see more vector-borne disease? is a natural country-level question because climate is a property of places. Policy questions land in similar territory when a national rollout reaches everyone at once, leaving no within-country variation to exploit—though when a rollout phases in by region or eligibility threshold, you can often do better with individual-level data.
It also earns its keep when individual-level data are unavailable or expensive. Researchers studying cancer mortality across 343 Latin American cities mapped wide between-city variation (overall rates differed by nearly threefold) and linked it to city-level socioeconomic development—work that would be impossible if you required individual medical records (Alfaro et al., 2025). Country-level studies of COVID-19 mortality and ambient temperature, or of pesticide use and cancer incidence by region, follow the same logic. The question is about places, not people.
THE ECOLOGICAL FALLACY
The danger has a name: the ecological fallacy. Group-level and individual-level associations don’t have to agree—they can be weaker, stronger, or point in opposite directions. The mechanism isn’t only confounding (wealthier countries eat more fish and have better cardiac care). Aggregation itself can scramble the signal: a country mean averages over people who differ enormously in exposure and risk, and the relationship across means can look nothing like the relationship across individuals. The classic illustration comes from 1930 U.S. Census data, where states with more foreign-born residents had markedly higher literacy rates—even though foreign-born individuals were only slightly less literate than native-born (Robinson, 1950). Aggregation didn’t just flip the sign; it inflated a near-trivial individual difference into a powerful state-level pattern.
Good ecological work foregrounds this limit. It frames findings as hypotheses about populations, not claims about individuals.
STRENGTHS AND LIMITATIONS
Ecological designs let you ask questions at the scale exposures actually operate—policy, geography, climate—using data that already exists. They’re inexpensive, fast, and well-suited to global comparisons.
The ecological fallacy is a permanent risk. Confounding is hard to address with aggregate data—you can’t adjust for individual-level covariates you don’t have. Measurement of both exposure and outcome happens at coarser resolution, which can mask within-group variation. Treat ecological findings as a starting point for hypothesis generation and triangulation, not as the last word on causation.
14.3 Modeling Studies: Mathematics Meets Epidemics
When the Zika virus started spreading through Latin America in 2015-16, what did public health officials need to know most urgently? Not just where cases existed today, but where the outbreak was likely to spread next week, next month. Which countries should prepare for surges in cases? How many doses of diagnostics would they need?
You can’t run an experiment to answer those questions. You can’t randomize countries to “receive Zika” or “not receive Zika.” And by the time you’ve observed the natural course of events, it’s too late to prepare.
Mathematical modeling is a go-to approach for these situations.
THE BASIC IDEA
Mathematical models use equations to represent how diseases spread through populations. Think of them as simplified versions of reality that capture the essential dynamics while ignoring unnecessary details.
The most common framework is the SEIR model, which divides a population into compartments: Susceptible (people who can catch the disease), Exposed (infected but not yet infectious), Infectious (able to transmit), and Recovered (immune). People flow from one compartment to the next based on parameters like the transmission rate, incubation period, and recovery rate. Change those parameters, and you change how the epidemic unfolds.
The SEIR model is “compartmental” because it divides the population into distinct compartments and tracks flows between them. It’s a bit like a bathtub filling and draining—the water level in each compartment depends on the rates of inflow and outflow.
FROM EQUATIONS TO FORECASTS: THE ZIKA EXAMPLE
When Zika hit, researchers built dynamic neural network models that could predict geographic spread in real time (Akhtar et al., 2019). These weren’t simple SEIR models—they incorporated:
- Weekly case counts from Pan American Health Organization data
- Passenger air travel volumes between countries
- Mosquito habitat suitability for Aedes aegypti
- Socioeconomic indicators like GDP, population density, and healthcare infrastructure
The model took all these data streams, normalized them, and used them to forecast which countries would become high-risk in the coming weeks.
And here’s what’s remarkable: even when predicting twelve weeks ahead—a lifetime in epidemic time—the model achieved over 85% accuracy. That meant officials could plan resource allocation not just for next week, but for the next three months.
DIFFERENT FLAVORS OF MODELS
Not all models are created equal. Here are the main types you’ll encounter:
Deterministic models: Everyone follows the equations exactly. If you run the model twice with the same starting conditions, you get identical results. Good for understanding general trends.
Stochastic models: Random variation is built in. Run it twice, get different results, reflecting the reality that epidemics don’t unfold in perfectly predictable ways. Better for capturing uncertainty (Funk et al., 2019).
Agent-based models: Instead of tracking populations, these track individuals. Each “agent” (person or mosquito) has its own characteristics and follows behavioral rules. When thousands of agents interact, population-level patterns emerge (Bomblies, 2014; Smith et al., 2018).
THE EBOLA FORECASTING EXPERIENCE
During West Africa’s Ebola epidemic, modelers generated weekly forecasts for specific regions, publishing them online as new data arrived (Funk et al., 2019). These forecasts informed real resource allocation decisions.
The short-term forecasts (1-2 weeks ahead) proved quite reliable, but forecasts beyond three weeks became increasingly unreliable. Why? Because human behavior changes in response to epidemics. People start avoiding handshakes, funerals change, clinics adapt protocols. The models that worked best incorporated these behavioral changes through “dampening mechanisms” that constrained how fast case numbers could grow.
What does “dampening mechanism” actually mean in practice? Think about it this way: a simple SEIR model might predict exponential growth—each infected person infects R others, who each infect R more, and so on. But in reality, as an epidemic grows, people respond. They change their behavior. Clinics improve infection control. Communities modify burial practices. All of these responses slow transmission. Models that incorporated decay terms or spatially heterogeneous contact patterns—recognizing that not everyone mixes with everyone else equally—reduced forecast error by approximately 90 percent.
The practical implication is to use models for near-term operational planning, but don’t expect them to predict the distant future with precision. Update your forecasts constantly as new data arrives. And critically, don’t just run the model with whatever parameters you started with. Calibrate it repeatedly as you learn more about how the epidemic is actually unfolding. The researchers generating Ebola forecasts didn’t just assess whether their predictions were close to reality; they employed probabilistic calibration assessment to evaluate how well they had quantified uncertainty. Getting the point estimate right matters, but so does getting the uncertainty bounds right.
STRENGTHS AND LIMITATIONS
Models can attempt to answer “what if” questions that are hard or impossible to address experimentally, integrate data from multiple sources, and generate forecasts for planning and resource allocation. They can reveal dynamics—tipping points, threshold effects, feedback loops—that aren’t obvious from empirical data alone.
But models are only as good as the assumptions built into them. They require high-quality input data, and uncertainty increases with the forecasting horizon. Critically, model outputs are projections under specific assumptions, not predictions of the future. A model that hasn’t been validated against out-of-sample data may be fitting noise. And models that treat human behavior as static—ignoring that people change their actions during epidemics—will systematically overestimate outbreak size. Always run sensitivity analyses to identify which parameters drive your results and where better data would improve your projections.
14.4 Synthetic Control Method
In 1994, the newly elected ANC government in South Africa began aggressively raising cigarette excise taxes. By 2004, the real excise tax had climbed by 249%, and the average retail price by 110%. Per capita consumption fell. The natural conclusion is that the tax policy caused the decline.
But consumption was already falling before 1994. The economy was contracting, public health awareness was building, and modest regulatory steps were underway (Chelwa et al., 2017). A simple before-and-after comparison would credit the tax for declines that were happening anyway. The question every policy evaluator faces in moments like this: what would have happened without the intervention?
You can’t observe that directly. South Africa either raised taxes or it didn’t. And no single country is a clean comparison—every other low- or middle-income country differs from South Africa on demographics, economy, baseline consumption, regulatory environment. The standard tools, matching and regression adjustment, aren’t built for a setting with one treated unit.
Synthetic control methods are. The idea, introduced by Abadie and Gardeazabal (Abadie et al., 2003) and formalized by Abadie, Diamond, and Hainmueller (Abadie et al., 2010), is to construct a counterfactual by weighting together untreated units so the weighted average closely tracks the treated unit’s pre-intervention trajectory. If it tracks well before the intervention, it provides a credible estimate of what would have happened after.
“Synthetic” here means constructed from real data. The synthetic control is a weighted average of real countries, not output generated by a model. This is different from the “synthetic participants” some researchers now propose simulating with large language models—that’s a separate idea with separate problems.
HOW IT WORKS
The procedure runs in five steps:
- Identify the treated unit and intervention date.
- Assemble a donor pool of untreated units that share enough underlying characteristics to be plausible building blocks.
- Choose matching predictors—variables believed to drive the outcome.
- Solve for weights that minimize the gap between the treated unit and the synthetic control during the pre-period. Weights are constrained to be non-negative and sum to one, which means the synthetic control is a convex combination: an interpolation among real units, not an extrapolation beyond them.
- Project the synthetic control forward. The gap between the actual treated unit and the synthetic control after the intervention is the estimated effect.
SYNTHETIC SOUTH AFRICA
Chelwa, van Walbeek, and Blecher (2017) built a synthetic South Africa from a donor pool of 24 countries that had not implemented large-scale tobacco control during the study period—Latin American, sub-Saharan African, North African, and South-East Asian nations plus Brazil, India, and China. They matched on the real price of cigarettes, real GDP per capita, per capita alcohol consumption, the proportion of adults in the population, and lagged cigarette consumption.
The optimization assigned positive weights to five countries: Brazil (47.6%), Argentina (27.6%), Chile (14.6%), Tunisia (9.4%), and Romania (0.7%). The rest received zero. During the pre-intervention period from 1980 to 1993, this weighted combination tracked South Africa’s cigarette consumption almost exactly—the average annual difference was about one cigarette per capita against a baseline of roughly 1,000.
After the tax increases took effect, the two series diverged. By 1995, the first full post-policy year, South Africa’s consumption was 38 cigarettes per capita (about 4%) below the synthetic counterfactual. By 2004, the gap had widened to 290 cigarettes—36% below what consumption would have been without the tax increases.
HOW TO TELL IF THE ESTIMATE HOLDS UP
A synthetic control study lives or dies on two questions. Did the pre-intervention fit actually match? And could the estimated effect be driven by something other than the intervention?
The first is visual. You should be able to see the two series tracking closely before the intervention. If the synthetic control diverges from the treated unit in the pre-period, the post-period gap isn’t a clean treatment effect estimate.
The second comes from sensitivity checks. The clearest is leave-one-out: re-estimate the synthetic control after dropping each donor unit in turn. If the result depends on a single country—say, Brazil, which carried nearly half the weight in the South Africa study—dropping that country should make the estimated effect collapse. It didn’t. The pattern of divergence held across all five leave-one-out specifications, evidence that no single donor was driving the result. The authors also tested an alternative definition of the donor pool, and synthetic South Africa’s trajectory under that specification was similar.
A third common check is the in-space placebo. Apply the same procedure to each donor country, treating it as if it were the treated unit. If the actual treated country’s estimated effect is unusually large compared to the placebo effects, that’s evidence the result isn’t just noise. This kind of permutation logic is how synthetic control studies typically argue for statistical significance, because the method doesn’t yield a confidence interval the way regression does.
WHEN TO USE IT AND WHAT IT CAN’T DO
Synthetic control fits a specific gap in the quasi-experimental toolkit. Use it when one unit—usually a country, region, or state—experienced an intervention and you have a credible donor pool of comparable untreated units along with several years of pre-intervention data. National or sub-national policies are the natural target: tobacco taxes, abortion restrictions, large-scale funding shifts. Ahsan and colleagues (Ahsan et al., 2025) used a variant of the approach to estimate the mortality impact of U.S. global health investments, aggregating 16 recipient countries into a single treated unit and constructing the synthetic control from 19 nonrecipient countries. Their conservative estimate was 1.0 to 1.3 million deaths averted among women of reproductive age between 2009 and 2019.
The method’s deepest limitation is shared by every counterfactual estimator. If something else shifted the treated unit’s trajectory at the same time as the intervention, the estimated effect captures both. Chelwa and colleagues handled this by ending their analysis in 2004, before broader international tobacco-control efforts began affecting many of the donor countries and contaminating the comparison (Chelwa et al., 2017).
Three additional constraints matter. Pre-intervention fit has to be good. Without it, the method isn’t reliable. The donor pool has to be rich enough that a convex combination of donor units can plausibly approximate the treated unit—if the treated unit is too unusual relative to anything available, the method can’t produce a good match. And inference is permutation-based rather than classical, so be cautious of synthetic control studies that report standard p-values without explaining how they arrived at them.
14.5 Economic Evaluations
You’ve identified an intervention that works. The harder question: is it a good use of resources? Health budgets are finite. Money spent on Intervention A is money not available for Intervention B. An intervention might save lives but cost so much that you could save more lives by funding something else.
Economic evaluation is the umbrella for methods that make these trade-offs explicit. A cost analysis asks what an intervention costs to deliver. Cost-effectiveness analysis (CEA) divides costs by a natural health outcome—infections averted, lives saved—to give a cost per unit of effect. Cost-utility analysis (CUA) uses a generic metric like DALYs or QALYs, which lets you compare interventions targeting different diseases. Cost-benefit analysis (CBA) monetizes everything, including health benefits, so you can compare across sectors (Moreland et al., 2019).
In global health, CEA and CUA do most of the work, and the line between them blurs in practice—many studies report cost per DALY averted and call themselves CEA. The summary metric is the incremental cost-effectiveness ratio (ICER): the difference in cost between two options divided by the difference in effect.
We return to economic evaluations in depth in ?sec-policy, where they sit naturally within the translation phase of moving evidence to action. There you’ll find worked examples, threshold discussions, and the ingredients method for getting the cost side right.
14.6 Diagnostic Accuracy Studies
Most clinical decisions start with a question: does this person have the disease or not? A diagnostic test is any procedure that helps answer it—a blood smear for malaria, a chest x-ray for pneumonia, a rapid antigen test for HIV, a PCR assay for SARS-CoV-2. Some are simple and cheap. Others require specialized equipment and trained operators. All of them share the same job: take a measurement on a patient, then use it to decide whether the disease is present or absent.
No test gets this right every time. There are two ways to be wrong. The test can miss real disease—a person who has TB tests negative and walks out untreated. Or it can flag disease that isn’t there—a person without TB tests positive, gets put on a six-month treatment regimen, and is exposed to side effects for nothing. These errors don’t cancel out. They have different consequences, and different tests trade off differently between them.
Diagnostic accuracy is the umbrella term for how often a test gets these answers right. Two numbers dominate the conversation. Sensitivity is the proportion of people with the disease who test positive—the test’s ability to catch real cases. A test with 90% sensitivity identifies 9 out of every 10 truly sick people. Specificity is the proportion of people without the disease who test negative—the test’s ability to correctly clear healthy people. A test with 95% specificity correctly clears 95 out of every 100 healthy people, and falsely alarms on the other 5. Both matter. A test that calls everyone positive has 100% sensitivity and zero practical value. A test that calls everyone negative is perfectly specific and misses every case. Real tests sit somewhere in the middle, and the right trade-off depends on the disease, the population, and what happens after a positive result.
THE TWO-BY-TWO TABLE AND ITS OFFSPRING
Every accuracy metric comes from the same table. Apply the index test (the one you’re evaluating) and a reference standard (the best available measure of true disease status) to the same patients, then cross-tabulate.
| Disease present | Disease absent | |
|---|---|---|
| Index test positive | True positive (a) | False positive (b) |
| Index test negative | False negative (c) | True negative (d) |
From these four cells you can derive every common diagnostic measure:
| Measure | What it asks | Formula |
|---|---|---|
| Sensitivity | Of those with disease, what proportion test positive? | a / (a + c) |
| Specificity | Of those without disease, what proportion test negative? | d / (b + d) |
| Positive predictive value (PPV) | Of those who test positive, what proportion truly have it? | a / (a + b) |
| Negative predictive value (NPV) | Of those who test negative, what proportion are truly disease-free? | d / (c + d) |
| Positive likelihood ratio (LR+) | How much does a positive shift the odds of disease? | sensitivity / (1 − specificity) |
| Negative likelihood ratio (LR−) | How much does a negative shift the odds of disease? | (1 − sensitivity) / specificity |
Sensitivity and specificity are properties of the test in a given population. PPV and NPV depend on prevalence: the same test produces a different PPV in a high-prevalence referral clinic than in a low-prevalence screening setting. To see why, imagine a test that is 98% specific applied in a population where only 1% of those tested actually have the disease—most positives will still be false alarms. Likelihood ratios let you update a pre-test probability into a post-test probability without recomputing predictive values each time prevalence shifts—they’re prevalence-independent properties of the test.
WHEN THE RESULT IS A NUMBER, NOT A VERDICT
Some tests return a binary result by design—a lateral flow strip shows a line or it doesn’t. But many tests produce a continuous value: an antibody titer, a viral load, a CD4 count, a quantitative biomarker. Sensitivity and specificity then depend on where you draw the line.
Move the positivity threshold higher and you raise specificity but lose sensitivity. Move it lower and you do the reverse. Plotting sensitivity against (1 − specificity) across thresholds gives you a receiver operating characteristic (ROC) curve, and the area under the curve (AUC) summarizes overall discrimination across all possible cutoffs. An AUC of 0.5 means the test is no better than a coin flip; 1.0 means perfect separation between cases and non-cases. Most useful tests live between 0.7 and 0.95.
Where you actually set the threshold is a clinical decision, not a statistical one. If missing a case is far worse than calling a false positive—say, screening for an aggressive cancer with effective early treatment—you pick a low threshold to maximize sensitivity. If false positives trigger invasive workup or harmful treatment, you pick a higher threshold. The ROC curve maps the trade-off; the choice of point on the curve depends on what comes after the test.
FROM DISCOVERY TO DEPLOYMENT
A diagnostic test moves through several stages before it lands in a clinic. Skip a stage and you risk discovering, after deployment, that what worked beautifully on stored samples falls apart in real patients.
Analytical validation comes first. Does the assay reliably measure what it claims to measure? You characterize accuracy and precision against known quantities of the target, establish the limit of detection, assess reproducibility across operators and reagent lots, and check whether common interferents throw off the result. This work happens under controlled conditions, often with stored or contrived samples, and it answers a narrow question: does the assay work as laboratory chemistry?
Clinical validation is the harder step. Does a positive test correlate with disease in the people you’d actually use it on? You take the test into the intended-use population—patients presenting with symptoms, screening candidates, whoever the clinical scenario specifies—and compare its results against a reference standard. This is where diagnostic accuracy is estimated.
Clinical utility asks the question that often gets skipped: does using the test improve outcomes patients care about? A test can be analytically perfect, clinically accurate, and still useless if the result doesn’t change what clinicians do, or if the change in management doesn’t help patients. Utility usually requires a different design—often a randomized trial comparing care with and without the test—because accuracy alone tells you nothing about downstream impact.
DESIGNING A DIAGNOSTIC ACCURACY STUDY
Several choices determine whether your accuracy estimates can be trusted.
Choice of reference standard. Your accuracy estimates can only be as good as the standard you’re comparing against. For TB, mycobacterial culture is the conventional reference but misses paucibacillary disease. For depression, the standard is often a structured clinical interview, which is itself imperfect. When the reference is imperfect, an index test that detects cases the reference misses will look falsely inaccurate. Composite reference standards and latent class methods exist for these situations, but each carries its own assumptions.
Single-gate vs. two-gate enrollment. A single-gate (consecutive) design enrolls everyone presenting with the relevant clinical question and applies both tests to all of them. A two-gate (case-control) design enrolls known cases and known non-cases separately, then runs the index test on each group. Two-gate studies are easier to assemble but tend to overestimate accuracy because the spectrum—florid cases on one side, healthy controls on the other—doesn’t reflect the diagnostic dilemma the test will face in practice. Single-gate is the stronger design when you can manage it.
Spectrum of disease and of controls. Sensitivity is usually higher in advanced disease than in early or mild disease, and specificity is usually higher when controls are healthy than when they have related conditions that might confuse the test. A study that enrolls only severe cases and healthy volunteers will overstate performance everywhere else. The right spectrum is the one the test will actually encounter.
Blinding. The person interpreting the index test should not know the reference result, and vice versa. Without blinding you get review bias—the interpreter unconsciously calibrates their reading to match the answer they already expect.
Verification of all participants. Every enrolled patient should receive the reference standard, not only those whose index test was positive. Verification bias (sometimes called workup bias) creeps in when negative index tests are less likely to be sent for confirmation, which inflates apparent sensitivity. When the reference is too invasive to apply universally, the design needs explicit corrections for partial verification.
Independence of reference and index. If the index test contributes to the reference standard’s adjudication—say, the radiologist diagnosing pneumonia has seen the rapid antigen result—you get incorporation bias, and the index test will look more accurate than it really is.
Sample size. The relevant calculation is the precision you need around sensitivity and specificity, not the power to detect a difference between groups. Sensitivity is estimated only among patients with disease, so the required sample size depends on disease prevalence in the enrolled population. A rare disease forces you to screen many patients to find enough cases for a stable sensitivity estimate.
The STARD reporting checklist (Bossuyt et al., 2015) is the field standard for what to report. Whether you’re designing your own study or appraising someone else’s, walking through STARD is the fastest way to see whether these choices were handled transparently.
THE TUBERCULOSIS DIAGNOSTICS REVOLUTION
Tuberculosis diagnosis traditionally relied on sputum smear microscopy—cheap and simple, but with only 50-65% sensitivity for culture-positive disease (Lawn, 2015). Culture-based diagnosis, while more sensitive, required 6-8 weeks for results due to slow mycobacterial replication.
The Xpert MTB/RIF assay represented a technological leap: rapid TB and rifampicin-resistance detection in 2 hours. Pooled sensitivity reached 90% for culture-positive cases, with 99% specificity. Critically, sensitivity remained comparable in HIV-positive and HIV-negative patients, despite higher rates of smear-negative disease in HIV-positive populations.
For rifampicin resistance specifically, pooled sensitivity hit 95%—but specificity was only 98% (Lawn, 2015). That 2% false-positive rate matters enormously when prevalence is low, because it means many positive results may be false alarms requiring confirmatory testing.
Positive predictive value depends on both test accuracy and disease prevalence. A test with 98% specificity sounds great, but in a population where only 5% have drug-resistant TB, many positives will be false positives. This is why confirmatory testing matters.
STRENGTHS AND LIMITATIONS
Diagnostic accuracy studies evaluate real-world performance rather than ideal conditions, identify implementation barriers before wide deployment, and can assess impact on patient outcomes beyond the diagnosis itself. Excellent laboratory sensitivity and specificity don’t automatically translate into population impact—supply chains, equipment maintenance, and care-seeking behavior all shape whether an accurate test actually saves lives.
The design requires access to a reference standard, which may not exist for some conditions. Performance varies across settings and operators in ways that laboratory validation can’t predict. Tests validated in patients with obvious disease may perform worse in early or mild cases (spectrum bias). And a positive result is only useful if it reaches a clinician, gets acted upon, and changes patient outcomes—the full diagnostic-to-treatment cascade matters as much as the sensitivity and specificity numbers.
14.7 Delphi Studies: Building Consensus When Evidence Runs Out
Imagine you’re tasked with building a set of indicators to monitor pandemic preparedness in low-resource settings. You can search the literature and review the evidence, but ultimately it’s a judgement call. You need experts to weigh in, people who’ve spent decades on outbreak response—distributed across continents. How do you turn that distributed expertise into consensus?
The Delphi method is one rigorous way. It’s a structured process for eliciting expert judgment and progressively narrowing toward consensus, used heavily in global health for setting research priorities, developing indicators, agreeing on case definitions, and building clinical guidelines when empirical evidence is sparse.
THE BASIC STRUCTURE
A Delphi runs in rounds. Round one asks experts a question—open-ended or rating-based—independently and anonymously. The research team aggregates the responses. Round two shows each expert the group’s aggregated answer and asks them to rate or rank again, with the option to revise. The process iterates over two to four rounds until ratings stabilize or a predefined consensus threshold is met.
Two features make this more than a fancy survey. Anonymity blunts the deference, hierarchy, and personality dynamics that derail in-person consensus meetings. Controlled feedback lets participants update on the group’s view each round without being pressured to capitulate face-to-face.
WHERE IT FITS
Delphi is a strong choice when an empirical answer doesn’t exist and won’t soon, when the judgment you need is genuinely the aggregate of expert opinion, when stakeholders span geographies, or when a decision needs the visible legitimacy of we asked thirty experts in a transparent process rather than we asked three people we know. Priority-setting in global health, the CHNRI methodology used by the Journal of Global Health for child-health research priorities, and indicator development for the Sustainable Development Goals all use Delphi-family approaches at their core.
STRENGTHS AND LIMITATIONS
Delphi turns scattered expert judgment into a documented process. It’s relatively cheap, scales geographically, and produces an audit trail of how consensus was reached.
It also embeds whatever biases the expert panel brings. A panel that’s, say, 80% high-income-country researchers will produce a consensus that reflects high-income-country priorities, no matter how rigorous the process. Convergence isn’t the same as correctness. A Delphi can converge on a wrong answer if every expert shares the same blind spot. And the choice of who counts as an expert is itself a judgment call that shapes the result. Report panel composition and recruitment logic with the same transparency you’d give a study’s eligibility criteria.
14.8 Closing Reflection
Three themes run across these designs. First, rigor and feasibility aren’t opponents. N-of-1 designs provide causal evidence from a single patient. Synthetic controls construct counterfactuals from data when no comparison group exists. Delphi studies build consensus when evidence runs out. Each arose from a practical constraint that the standard toolkit couldn’t handle.
Second, accuracy or efficacy alone doesn’t deliver impact. A test with excellent sensitivity and specificity is wasted if cartridges expire before reaching patients, or if positive results don’t reach a clinician who acts on them. Economic evaluations that ignore delivery costs, and diagnostic studies that stop at sensitivity and specificity, share the same blind spot.
Third, these designs don’t exist in isolation. Modeling studies require observational data to parameterize equations. Synthetic controls borrow logic from time-series and matching methods. Delphi processes often feed into priority-setting for designs of every kind. The best research often combines approaches, using each design’s strengths to compensate for another’s limitations.
Most researchers become expert in one or two while maintaining enough working knowledge of the others to collaborate effectively or evaluate published work. The goal is recognizing when a question calls for a specialized tool—and knowing enough to reach for it or call in someone who can.