Special Section
Solving the Rubik’s Cube of Competing Forces to Realize the Future of Clinical Research
Part 1: AI and Data Quality in a Decentralized World
Steve Young

rtificial intelligence (AI) has massive potential to advance clinical research. A scoping review in 2022 found that use of AI in the design and conduct phases of clinical trials has a positive impact on efficacy, safety and cost containment. From accelerating recruitment and enhancing patient selection, to identifying data quality issues through enhanced statistical analysis, machine learning (ML) and deep learning (DL) approaches are unlocking new possibilities.

But any algorithm is only as good as the data it is built upon. Hybrid and decentralized clinical trials (DCTs) are increasingly popular because they can increase access to diverse participant populations while also boosting patient centricity and retention. However, their reliance on patient-reported data can introduce risk and erode data quality.

To fully realize the future of clinical research, we need to adopt AI-driven risk-based quality management (RBQM). It is the best way to ensure and enhance the accuracy of all data, whether the source be wearables, apps, home healthcare, telehealth visits or traditional site methods.

The Tension

Hybrid and decentralized trials are on the increase –some experts estimate that DCT volume will rise by 17 percent by the end of this year, surpassing peak activity observed in 2021. The benefits of remote trials have been covered at length. The FDA says DCTs can reduce patient and sponsor burden and increase accrual and retention of a more diverse trial population.

However, this changing trial landscape is creating new challenges for data analysts. Decentralized and hybrid trials are based on a remote data collection model supported by wearables, patient-reported outcomes (PROs) and home health. The amount of data collected by these methods is significant and looks set to rise even further before the end of the decade. Half of clinical trials are predicted to incorporate wearables or sensors by 2025.

More methods of data collection means increased data volume, both in terms of sources and data points. Remote models can also increase the risk that data will be falsified and/or be less reliable. As the data environment becomes increasingly complex, traditional analysis methods are struggling to keep up. Only 1.1 percent of site-entered data is corrected by 100 percent source data verification (SDV), which already suggests that SDV has a negligible effect on overall data quality. The continued move towards direct patient data capture only further reduces the relevance of SDV and other data review methods.

AI has emerged as a possible way to improve data analysis. Machine Learning (ML) and Deep Learning (DL) can find patterns in huge datasets that would remain invisible to the human eye, streamline processes and help guide skilled staff with data which has already been reviewed and analyzed.

However, any AI model is only as good as the data it holds. Poor-quality input equals poor-quality output. Almost half (44 percent) of sponsors and contract research organisations (CROs) say data collection and reporting is a key challenge when introducing DCTs. Clinical trials previously relied on highly- qualified healthcare professionals (HCPs) to collect data. Now the burden of data collection is increasingly moving to the patient and questions have been raised about how effectively that data can be verified.

If we are to effectively implement this new clinical research paradigm, we need to find a way to balance the competing forces of patient centricity, efficiency and data quality.

The Solution

Risk-based monitoring focuses on what matters most to the clinical trial. The approach has been backed by the FDA in various official Agency documents including A Risk-Based Approach to Monitoring of Clinical Investigations Questions and Answers earlier in 2023.

RBQM applies this principle to the entirety of a clinical trial, allowing sponsors to manage data quality continuously. Advanced statistical methods have been highly effective at powering centralized monitoring of data from sites in near real-time. Statistical monitoring works on the principle that data from all sites should, aside from the random play of chance and systemic variations, be broadly similar.

ML and DL algorithms are now being used on top of these established methods to further improve the ability to identify unusual data patterns that represent emerging issues; for example, fraud, mis- calibrated equipment, or training issues at the site, region or study level. This allows researchers to investigate and take any necessary corrective action before data quality issues can affect results.

For example, data from more than 1,200 clinical trials are being used to develop a DL risk analysis optimization model that flags signals which are more likely to represent a real issue. This model is expected to be available within the next year, and will help prioritize signal review and ensure effective follow-up and documentation of findings.

AI can also improve efficiencies in other areas. For example, mapping adverse events and concomitant medications recorded in case report forms to MedDRA or WHO Drug dictionaries is traditionally a manual, time consuming task. A medical coding DL model can guide researchers to the correct corresponding term in mere seconds with greater than 95 percent accuracy.

These are just two examples which demonstrate how AI-driven RBQM approaches are not about replacing skilled staff but giving them the tools they need to do their job more efficiently and effectively on a day-to-day basis. This will help preserve data integrity, allow rapid corrective action on identified issues, and offer opportunities to gain new insights and improve outcomes for all stakeholders.

Case Study

Reanalysis of the database of a large randomized controlled trial (RCT) known to have been affected by fraud found that RBQM could have uncovered the problem approximately a year earlier than traditional, retrospective methods were able to.

Conducted in the 1990s, the Second European Stroke Prevention Study (ESPS2) included 7,000 patients across 60 sites in 13 countries. Severe inconsistencies in the case report forms (CRFs) of one site, #2013, led the trial’s steering committee to question the data’s reliability.

A for-cause analysis of quality control samples and extensive additional analyses, including blood concentrations of the investigational drugs, showed that patients had never received the protocol medications.

The site’s data covering 438 patients was excluded from the trial and the investigator was convicted of fraud. In total, this investigation took around a year to complete.

Researchers undertaking reanalysis applied mixed-effects statistical monitoring tests across the completed study database to identify unusual patterns at sites and regions, or among patient groups. Site #2013 was identified as the second most atypical site.

They then applied the tests on data that represented incrementally earlier time points in the execution of the study to replicate the effect of ongoing RBQM. They found site #2013 would have been detected as atypical when 25 percent of the site’s overall patient data volume had accrued in May 1991. Traditional methods were only able to detect a potential problem at 75 percent of data collection in June 1992 and confirmed fraud in January 1993.

Site fraud is rare but this example demonstrates how AI and RBQM can be used to ensure and enhance data quality in a DCT world.


Clinical trials are collecting more data than ever. This has the potential to enhance treatment safety and efficacy as new technologies diversify patient recruitment and data offers robust insights into disease, target populations, standards of care and treatment pathways. This is particularly important for previously underrepresented and excluded groups which can have distinct disease presentations or health circumstances. However it also creates new challenges for us as the custodians of that data.

We have a responsibility to make the best possible use of data while ensuring its reliability no matter what the source. AI-driven RBQM approaches can help us to do that and help us find balance between the competing forces of patient centricity, data quality and efficiency.