Data Distortions: When Statistics Can Lead Us Astray in Drug Safety

Data Distortions: When Statistics Can Lead Us Astray in Drug Safety

James Buchanan
Covilance

I

t is important to be mindful about how we apply statistical methods during the evaluation of safety data, as key misapplications of statistics can be found during both clinical development and in the post-marketing space.

For example, during clinical development we frequently compare adverse event frequency rates and try to discern when an AE frequency rate in a treatment group is substantially higher than that in a comparator group. We can run a statistical analysis, come up with a p value, and if it’s <0.05 we might be tempted to conclude that the difference is statistically significant. We might further conclude that this significant difference is due to the study drug. But this approach fails to appreciate that when we apply statistical methods to safety data, we are doing so for an exploratory analysis rather than as a confirmatory exercise. We didn’t design the trial with multiple safety endpoints in mind to justify the formal use of statistical comparisons like we would with an efficacy endpoint.

In the usual use of statistical tests, we have a hypothesis, then we design a study to test that hypothesis. We apply our statistical test to that single question, and we either get a statistically significant answer or not. In this setting the use of the statistical test is a confirmatory exercise; we wish to confirm whether our hypothesis is correct. But in drug safety we don’t start with a hypothesis about the frequency of adverse events, nor have we designed our clinical trial to test a safety hypothesis. We focus the statistical tests for the efficacy hypothesis. So, we explore the safety data instead. We can use statistical tools to do so, but we are much more limited in what we can conclude. Thus, we use statistical tools in drug safety to conduct exploratory analyses rather than confirmatory analyses. But, if we forget that, we are tempted to make conclusions that appear to “confirm” our findings. Therein our error lies.

Why is this important? Because if we walk away believing (based on a p value <0.05) that there is a statistically significant difference in the AE frequency rates between the treatment groups, we might (wrongfully) conclude that we have sufficient evidence to demonstrate that the difference is due to the study drug. A p value <0.05 tells us that there is less than a 5% chance that the difference is due to chance alone. But if we run this statistical analysis across 100 AE frequency rates, we’ll likely find five instances of a significant difference that are entirely due to chance alone. And we may fail to look for various confounding factors that produced the difference in the first place.

AE data can easily become confounded. We randomize patients before the initiation of study treatment to start from a level playing field, but new intercurrent illnesses may subsequently develop, or patients may receive other medications which exhibit their own AEs, all in a nonrandomized manner. One or more of these post-randomization events may explain this apparently significant difference in AE frequency rates between treatment groups. It is important to look closely for confounding factors in all cases where a significant difference in frequency rates is observed.

A similar situation is the use of various disproportionality algorithms (e.g., proportional reporting rate [PRR], reporting odds ratio [ROR], empirical Bayes geometric mean [EBGM]) to evaluate post-marketing data sets. These tools produce a statistical output that is used to define signals of disproportionate reporting (SDR). When an adverse event has an SDR score that exceeds a given threshold, we have a signal of disproportionate reporting. All too commonly, though, we misinterpret this statistic as evidence of a product risk. It’s further complicated by an unfortunate nomenclature. The term “signal” (as in “signal of disproportionate reporting”) is not the same “signal” as in a safety signal of interest that warrants further evaluation. The score that the algorithm generates has no inherent clinical meaning.

Indeed, high SDR scores can also be subject to confounding. Consider the following examples.

We apply a frequentist method (PRR) and a Bayesian method (EBGM) to an antibiotic used to treat various soft tissue infections. The algorithms output a list of AE preferred terms with PRR and EBGM05 scores greater than the threshold of 2 with at least five cases present in the data set. The term “otitis media” appears with a PRR of 63 and an EBGM05 of 83, both scores being substantially greater than the threshold of 2. But this is an example of confounding by indication. In the past, it was not uncommon for the indication of drug use to be coded to MedDRA along with the AEs. Thus, a drug used to treat otitis media will have a disproportionately higher rate of otitis media cases simply due to the indication for use.

This same analysis finds a PRR of 55 and EBGM05 of 67 for hirsutism. There were eight cases involving hirsutism, but a closer inspection of the characteristics of these cases (e.g., age, gender, date of onset, concomitant medications) found that five of the cases appeared to be duplicate reports. When the case count was corrected, the case count fell below the numerical threshold of 5.

Cases can also be confounded by the “bystander” effect. This analysis found that the term “dizziness” had a PRR of 75 and EBGM05 of 92. It was not uncommon for this antibiotic to be coprescribed with probenecid to maintain higher serum concentrations. Probenecid is associated with its own list of possible adverse reactions, including dizziness. Further investigation of these “dizziness” cases found that probenecid was listed as a concomitant medication in all instances.

What should drug safety professionals do instead? First, avoid the temptation to treat safety analyses like efficacy analyses. Instead of using p values in our evaluation of AE frequency differences, we can use other statistical tools to focus our attention on those events with the greatest between-group differences, such as the risk difference and the risk ratio. Then starting with the events showing the greatest difference in frequency rates, consider confounding factors as part of a thorough evaluation.

In the case of post-marketing disproportionality analyses, first realize that an SDR score above the threshold does not automatically make that event a safety signal requiring a thorough evaluation. Always consider the possible contribution of various confounding factors, including confounding by indication, the number of duplicate cases, and the bystander effect, among others. It remains up to drug safety professionals to decide which signals of disproportionate reporting represent clinically relevant safety signals that warrant further exploration.

Recognizing the limitations of statistical tools and their appropriate application can serve the safety professional well without running the risk of drawing erroneous conclusions concerning product risks. Interested in learning more? Consider these additional resources:

A Framework for Safety Evaluation Throughout the Product Development Life-Cycle

“Safety Signaling and Causal Evaluation” in Quantitative Drug Safety and Benefit Risk Evaluation

Important Considerations for Signal Detection and Evaluation

Data Science for Safety Professionals on-demand training course

Back to Issue