Enterprise AI Meets COVID-19April 14, 2020
Enterprise AI is about applying the sciences of data science and digital transformation to business and government processes. It is the fastest growing segment of enterprise computing. I am confident that the largest commercial application of AI will be precision medicine: disease prediction, genome-specific medical protocols, and AI-assisted diagnosis will result in greater availability of more efficacious medical care at lower cost.
It is clear that there is a ripe opportunity to apply Enterprise AI to mitigate the global COVID-19 crisis. Ripe avenues for impactful breakthroughs from AI include but are not limited to:
- Applying machine learning/AI methods to mitigate the spread of the COVID-19 pandemic
- Genome-specific COVID-19 medical protocols
- Biomedical informatics methods for drug design and repurposing
- Modeling, simulation, and prediction of COVID-19 propagation
- Efficacy of COVID-19 interventions
- Broader efforts in biomedicine, infectious disease modeling, response logistics and optimization, public health efforts, tools, and methodologies around the containment of rising infectious diseases, and response to pandemics so as to be better prepared for future infectious diseases.
Absent rich data sets, it is impossible perform meaningful AI, so it is no surprise that we are seeing many organizations publish COVID-19 related data sets in the public domain to fuel COVID-19 Enterprise AI efforts. Some notable recent efforts include:
- Johns Hopkins University: COVID-19 Data Repository
- The COVID Tracking Project
- The New York Times: COVID-19 Data in the United States
- nCoV-2019 Data Working Group: Epidemiology Data
- MOBS Lab: COVID-19 Situation Report
- World Health Organization: Daily Situation Reports
- European Centre for Disease Prevention and Control: Worldwide Situation Updates
- University of Montreal: COVID-19 Image Data Collection
- National Center for Biotechnology Information Virus Database
- COVID-19 Open Research Dataset (CORD-19)
- The AWS COVID-19 Data Lake
- Google BigQuery COVID-19 Dataset
- C3.ai COVID-19 Data Lake
Our analysis of the data sets being published suggest that these efforts fall into three categories:
Type 1: Lists of URLs
Type 2: Libraries of Discrete Data Sets
Type 3: Unified, Federated Data Images
All are positive contributions, but the three categories vary considerably in the potential benefit they offer.
Each is described below:
Type 1: Lists of URLs. The first type, like the CORD-19 project and MITRE Healthcare Coalition, consists of providing lists of unique URLs that point to different datasets that are stored in different locations, in different data structures and formats (e.g., text, images, numerical data, voice, etc.).
Type 2: Libraries of Data Sets. The second type, like the AWS COVID-19 Data Lake program and the Google Open Cloud Platform, has taken many of the data sources accessible through the URLs referenced above and stored them in a digital library that is a collection of unique data storage systems (Postgres, Dynamo DB, Neptune, Redshift, etc.). These are located in a common “physical” storage utility like AWS S3 and the Google Cloud. For those data sets that have common data structures, e.g. CSV or JSON, each unique data set is stored in a unique database in that common S3 storage utility. In these systems, the datasets may be individually accessed through Amazon software utilities like Postgres and SageMaker. Access is free up to a point, where data volumes reach a limit, or the researcher wants to integrate those with other externally available datasets and then a fee structure kicks in. The Amazon Data Lake is accessible through Amazon’s data access products. The Google Data Sets are available through Google Cloud utilities. The data are not pre-integrated nor federated.
Type 3: Integrated, Federated Data Images. The C3.ai COVID-19 Data Lake is unique in that we have curated those data sets that we understand to be of the most utility to researchers, including those listed above, and aggregated those data into a unified, federated, logical image that is immediately available for researchers to access through any utility that offers RESTful data access (e.g., Excel, Tableau, R, Python, etc.). Importantly, we have preestablished the important linkages in those complex data sets so that researchers can easily navigate and explore the data features that may be of interest (e.g., diagnosis, age, locale, preexisting condition, etc.) and can perform sophisticated data science on those data. Importantly, the C3.ai COVID-19 Data Lake provides researchers an abstraction layer to all of the disparate polyglot of structured and unstructured data, so that the researcher does not have to be aware of the physical and logical structure and associations of those data. This data set is immediately extensible by the data scientist and can be easily linked with other external data sets. The C3.ai COVID-19 Data Lake is easily extensible by the end user and can be linked with external data sources.
As a result of the integration and federation of the data, we are able to provide rich knowledge graphs to assist the researcher in understanding the scope of and connections in the data:
C3.ai COVID-19 Knowledge Graph
The C3.ai COVID-19 Data Lake pre-establishes the important linkages in the disparate COVID-19 data sets sourced from all over the globe, so that researchers can easily navigate and explore the data features that may be of interest (e.g., diagnosis, age, locale, preexisting condition, etc.) and can perform sophisticated data science on those data.