Cancer, Cloud Computing, and a National Research Data Ecosystem for Cancer
Warren A. Kibbe
Chief, Translational Biomedical Informatics, and Chief Data Officer
Duke Cancer Institute, Duke University

pen science and open data initiatives have become somewhat of a cause célèbre due in part to activities like the European Public Sector Information (PSI) Directive and Plan S which are striving to make open access publication the norm in the EU. However, these activities are the result of many years of work to push open access for publicly-funded research data, including the Human Genome Project, Genbank, PubMed, PubMedCentral, the FDA’s Information Exchange and Data Innovation (INFORMED) and Sentinel initiatives, and a long cadre of important projects too numerous to mention. Groups like Force11 have pushed for the open sharing of data, most notably through the set of FAIR principles for making data Findable, Accessible, Interoperable and Reusable. Likewise, the private sector, particularly the pharmaceutical industry, has moved more and more activities into shared space. These initiatives include Project Datasphere and the Yale Open Data Access Project (YODA) for sharing clinical research data, TransCelerate to share pre-competitive biopharmaceutical data, and Transmart for sharing analytical tools and workflows. Companies like Vivli focus on extracting value from large, shared, and accessible clinical studies.

These initiatives and projects have a variety of drivers, from making publicly-funded datasets more easily accessible, to improving reusability and interoperability of data, to reducing duplicate effort by deeply sharing results and functional data to speed discovery, and accelerating the rate of innovation by decreasing the time from data generation to data availability. While these drivers may not fully align, they all highlight the importance of data, data accessibility, and the need for well-described, well-annotated data for scientific insights, sharing and analyzing evidence, and laying a foundation for effective data reuse and reproducibility to increase the velocity and efficiency of scientific and technological innovation.

Cancer Research Data Ecosystem

With this backdrop, what is the role of a national research data ecosystem for cancer? While these initiatives and projects are relevant and include cancer examples, the opportunities and need to continue to “bend the curve” in cancer research require dedicated programs focused on current problems in cancer research. One of the opportunities in oncology that has been and will continue to be very important is embracing and advancing precision medicine approaches. A number of National Cancer Institute (NCI) activities have helped shape these opportunities, including The Cancer Genome Atlas (TCGA) started in 2005, Therapeutically Applicable Research to Generate Effective Treatment (TARGET), and the many clinically oriented projects such as NCI Molecular Analysis for Therapy Choice (MATCH) started in 2015.

These projects highlight the molecular landscape of cancer and provide through NCI MATCH a national trial that incorporates features of umbrella and basket trials for understanding the complex interplay between genomics and therapy. The Beau Biden Cancer Moonshot, funded through the 21st Century Cures Act, has created opportunities to extend these findings into understanding the role of tumor microenvironment, integrate traditional and novel imaging methods, incorporate epigenomics and proteomics, and then incorporate novel clinical trials and real world evidence. The NCI is actively exploring adding Real World Evidence (RWE) approaches to the existing portfolio of clinical trials to augment and extend its capabilities. This requires a very different data infrastructure than a repository focused on a specific technique, project, or technology.

The NCI embarked on two initiatives exploring how to represent and share genomic and limited clinical data from cancer patients aligned with a precision medicine view of cancer. The first, the Genomic Data Commons (GDC), led by Dr. Robert Grossman at the University of Chicago and spearheaded by Dr. Lou Staudt and the Center for Cancer Genomics at NCI, was to build a much more extensible and scalable infrastructure for sharing large scale cancer genomic projects like TCGA and TARGET. The other initiative was the Cloud Pilots, now the Cloud Resources, designed to fund multiple complementary models for how to effectively use cloud computing and storage to democratize access to genomic data. Both initiatives are ongoing and both have contributed to making cancer genomic data more sustainable, accessible, analyzable, and consistent. The Beau Biden Cancer Moonshot has also been an important contributor to data sharing, and projects like the Applied Proteogenomics, Organizational, Learning, and Outcomes network (APOLLO), the NCI Cancer Research Data Commons (CRDC), and the need for a national data ecosystem for cancer have emerged from the national discussions fostered by the Cancer Moonshot.

As these efforts have progressed, it is clear there are several “sticky” issues that still need attention, such as having a trained cancer informatics and data science workforce that can contribute to and benefit from these initiatives, and scalable processes for ingesting and annotating clinical, imaging, genomic, and other research data into resources like the GDC and the CRDC. We also need institutional, local, and national policies and incentives that reward data sharing, data reuse, and the work required to make basic, translational, and clinical research reusable and reproducible, supporting a FAIR ecosystem.

While it is not easy to quantify the value of reuse and reproducibility, it is clear that scientific innovation, implementation science (moving proven ideas and processes into the real world), and commercialization are all undermined when data are sequestered, are not well-described (annotated) or incompletely annotated, or the algorithms, tools or workflows necessary to analyze the data are not available and shared. The impact of irreproducible experiments has been well-described and is a significant barrier to turning discoveries into therapies and products that directly benefit patients; see, for example:

From a data sharing and cloud computing standpoint, it is important to note that data sharing in the pre-competitive space is much easier than in the competitive space. It is also critical that data sharing is designed and implemented so that intellectual property protections can be put in place to enable the commercial investment necessary to move new discoveries and products into the market. Stated differently, it is very important that discoveries in need of significant capital investment to reach the market, such as in the drug discovery space, enjoy intellectual property protection. The lasting value for cancer research in an ecosystem that values and rewards open data that are well described, ready for reuse, and connected with the algorithms and analytics, comes from the ability to derive insight, and to validate methods and knowledge from these data iteratively and reproducibly.

Data Representation

There are multiple ways to make cancer research and care data FAIR (Findable, Accessible, Interoperable and Reusable). An important component of interoperability is having the data in a common representation, annotated to known and shared common data elements, and expressed in a common data model. For clinical data, there are multiple well-documented, well-constructed, and well-maintained terminologies. These include SNOMED, LOINC, RxNorm, MedDRA, and ICD. Each has a domain and a conceptual context for describing, documenting, and reporting clinical information, including patient presentation, lab information, pharmacy data, adverse events, and billing. To augment these terminologies, the National Library of Medicine maintains UMLS to help crosswalk these terms. For cancer research and care, the NCI maintains EVS (the Enterprise Vocabulary Service) and the NCI Metathesaurus. An important usage of these terminologies is to enable the harmonization and interoperability of resources.

To encapsulate the data linked to these terms, HL-7 provides a number of resources, including Fast Healthcare Interoperability Resources (FHIR). FHIR APIs are a modern, JavaScript Object Notation, and a web-service focused data exchange protocol for exchanging well-described data. The Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) provides a data model for encapsulating data to be sent to the FDA for regulatory submission. The FDA also developed the data model for the Mini-Sentinel and Sentinel projects, providing a data model for capturing patient events and outcomes for phase 4 and post-approval drug surveillance. There is also the National Patient Centered Outcomes Research Network (PCORnet) Common Data Model (CDM), based on the Sentinel data model, used for health outcomes and surveillance communities. The Observational Health Data Science and Informatics (OHDSI) program Observational Medical Outcomes Partnerships (OMOP) CDM has some structural similarities to the PCORnet CDM, but is designed to capture data care pathways and outcomes. A National Research Data Ecosystem for Cancer will require terminologies, ontologies and data models to support basic science, translational projects and clinical trials to make them findable, accessible, interoperable, and to make the data reusable (supporting FAIR).

Computation and Cloud Computing

Another aspect of the opportunities and capabilities in applying technology, informatics, analytics and data science is to apply advanced analysis algorithms, machine learning techniques, and deep learning to important problems in cancer. These computational methods have an impact on nearly every aspect of cancer research and cancer care. It is important to realize that many of these methods have been around for almost forty years, with Hidden Markov methods and linear regression models being used in genetics, genomics, and clinical trials, and they possess a rich theoretical framework and basis. The computational power, maturity of algorithms, and the availability of digital data have continued to evolve and to increase the utility and applicability of these methods. Use of deep learning has proven to be highly effective in the imaging world, and promises to free researchers and clinicians from having to become experts at image processing and classification.

Two examples from an increasingly rich body of work are using convolutional neural networks to identify and classify melanoma, and identifying metastatic breast cancer from whole slide image classification. While these examples are incredibly important and show the power and promise of deep learning approaches to real problems in cancer research and cancer care, there are many ways to deliver, compute, and store data. The attraction and power of cloud computing is that it abstracts away from the consumer the need to maintain physical IT resources; at the same time, it provides scalable access, scalable computational capabilities, and scalable storage. IT organizations embedded in a small or even large medical center, hospital, or primary care network find it very hard to provide these at the same cost for the same service level. This has caused and continues to cause a transformation of organizational IT groups from providers of IT services into facilitators and guiders to their organizations in coupling IT services and resources more effectively to serve the business needs of the organizations–whether in patient care, cost recovery models, finance and administration, research, or building a learning health unit. The transformation of IT units from purveyors of IT to collaborators and guides for the organization to effectively use EHRs, clinical systems, billing systems, manage genomic data, run the molecular tumor board, and the myriad of other data and computer critical activities, is profound and irreversible. For precision medicine, the capabilities available through the NCI Cloud Resources are transformative, highly accessible, data rich, and computationally robust.