Creating a Unified COVID-19 Global Resource in Record Time

Implementing a Real-Time Open Source Data Lake

To help the global research community in the fight against the COVID-19 pandemic, C3.ai has developed and open sourced the C3.ai COVID-19 Data Lake™, a single, unified data image of critical COVID-19 data. Built on the C3 AI Suite™, the C3.ai COVID-19 Data Lake integrates data from disparate sources, then models and presents those data in a unified, cohesive structure. The C3.ai COVID-19 Data Lake is available at no cost to the global research community. The data is accessible via any utility that supports a RESTful interface including commonly used tools such as Python, R, and Microsoft Power BI.
The first two releases of the C3.ai COVID-19 Data Lake integrate data from 22 different sources, and the Data Lake is designed to grow and scale over time. We anticipate making additional data sources available every few weeks. The current release of the C3.ai COVID-19 Data Lake has already generated significant interest.

We anticipate this integrated data lake will help address a wide range of public and private sector problems, including:

  • Forecasting the spread of COVID-19 across municipalities and counties
  • Forecasting demand for ICU bed capacity to plan for the anticipated surge in COVID-19 cases
  • Supporting COVID-19 diagnosis using test results and medical imagery
  • Performing rapid literature scans across COVID-19 research using Natural Language Processing (NLP) techniques
  • Supporting genome sequence analysis for vaccine development
  • Identifying impact of government responses and mandates on population mobility and infection rates 
  • Comparing spread of COVID-19 infections in different geographies

Challenge

The COVID-19 virus spread across the globe in a matter of weeks and has infected people practically across all countries. Technology and data are critical enablers in the fight against this global pandemic. Having up-to-date, accurate, representative, and comprehensive COVID-19 data can help us identify disease progression patterns, project and resolve healthcare capacity issues, understand the virus genome better, and develop medicines and vaccines to prevent future outbreaks. There are a number of challenges that the developer and researcher community is facing with publicly available COVID-19 data:

  • Lack of a central repository for all the COVID-19 data proves challenging for developers and researchers as data often resides in disparate sources. Although there are ongoing efforts to categorically integrate COVID-19 case data, such as the COVID-19 Tracking Project by the Atlantic, a broader repository that brings together all the clinical (e.g., symptoms), biological (e.g., genomics), epidemiological, economic, and environmental data is missing.
  • Data consistency and standardization is a big problem for developersand researchers as there is no commonly agreed framework to store and distribute COVID-19 data. Developers need to individually fuse and resolve different data stores for potential data analyses and need to perform custom data integrations to support analytic pipelines – efforts that are time consuming and inefficient.
  • Data quality is a persistent challenge with COVID-19 datasets. Developers and researchers spend significant time and effort to manually cleanse, normalize, and structure COVID-19 data ahead of data analytic efforts.

Project Highlights

  • 3-week development timeline to release first version of the C3.ai COVID-19 Data Lake
  • 2-3 full time equivalent days to integrate each additional new data source, including understanding data structure, loading historical sets, and configuring settings for future data streams
  • Wide range of data types unified on the C3 AI Suite, including case time series, genome sequences, medical images, mobility, hospital capacity, economic data, and journal articles (text)
  • Open data repository is continually enhanced with contributions and suggestions from developer and research community through additional data sources and ML models

For a list of most updated data sources included in the C3.ai COVID-19 Data Lake, please visit c3.ai/covid

Results

22
different data sources integrated and unified using a single data image as of May 15, 2020
Automated
data ingestion to keep the C3.ai COVID-19 Data Lake up to date
RESTful
APIs to access the unified data using Python, R, Microsoft Power BI, and C3.ai Ex Machina
Notebooks
for easy start using popular analysis tools such as R and Python

Proven results in weeks, not years

timeline
Get insights into C3.ai’s capabilities, enterprise AI best practices, and highest-value use cases.
Gain insights into the C3 AI Suite's capabilities, its model-driven architecture, and test it against your company's sample data set.
Identify a high-impact business problem and collaborate with the C3.ai team to rapidly build an AI application that solves it.
Scale and deploy a tested C3.ai application into production. Incorporate user feedback and optimize algorithms to drive maximum economic value.

Thank you for your interest in C3.ai

We'll review your information and a team member will get back to you within 24-48 hours.

Return to home