Creating a Unified COVID-19 Global Resource in Record Time

Implementing a Real-Time Open Source Data Lake

To help the global research community in the fight against the COVID-19 pandemic, has developed and open sourced the C3 AI COVID-19 Data Lake™, a single, unified data image of critical COVID-19 data. Built on the C3 AI Suite™, the C3 AI COVID-19 Data Lake integrates data from disparate sources, then models and presents those data in a unified, cohesive structure. The C3 AI COVID-19 Data Lake is available at no cost to the global research community. The data is accessible via any utility that supports a RESTful interface including commonly used tools such as Python, R, and Microsoft Power BI.
The C3 AI COVID-19 Data Lake integrates data from 40 different sources, and the Data Lake is designed to grow and scale over time. We anticipate making additional data sources available every few weeks. The current release of the C3 AI COVID-19 Data Lake has already generated significant interest.

We anticipate this integrated data lake will help address a wide range of public and private sector problems, including:

  • Forecasting the spread of COVID-19 across municipalities and counties
  • Forecasting demand for ICU bed capacity to plan for the anticipated surge in COVID-19 cases
  • Supporting COVID-19 diagnosis using test results and medical imagery
  • Performing rapid literature scans across COVID-19 research using Natural Language Processing (NLP) techniques
  • Supporting genome sequence analysis for vaccine development
  • Identifying impact of government responses and mandates on population mobility and infection rates 
  • Comparing spread of COVID-19 infections in different geographies


The COVID-19 virus spread across the globe in a matter of weeks and has infected people practically across all countries. Technology and data are critical enablers in the fight against this global pandemic. Having up-to-date, accurate, representative, and comprehensive COVID-19 data can help us identify disease progression patterns, project and resolve healthcare capacity issues, understand the virus genome better, and develop medicines and vaccines to prevent future outbreaks. There are a number of challenges that the developer and researcher community is facing with publicly available COVID-19 data:

  • Lack of a central repository for all the COVID-19 data proves challenging for developers and researchers as data often resides in disparate sources. Although there are ongoing efforts to categorically integrate COVID-19 case data, such as the COVID-19 Tracking Project by the Atlantic, a broader repository that brings together all the clinical (e.g., symptoms), biological (e.g., genomics), epidemiological, economic, and environmental data is missing.
  • Data consistency and standardization is a big problem for developersand researchers as there is no commonly agreed framework to store and distribute COVID-19 data. Developers need to individually fuse and resolve different data stores for potential data analyses and need to perform custom data integrations to support analytic pipelines – efforts that are time consuming and inefficient.
  • Data quality is a persistent challenge with COVID-19 datasets. Developers and researchers spend significant time and effort to manually cleanse, normalize, and structure COVID-19 data ahead of data analytic efforts.

Project Highlights

  • 3-week development timeline to release first version of the C3 AI COVID-19 Data Lake
  • 2-3 full time equivalent days to integrate each additional new data source, including understanding data structure, loading historical sets, and configuring settings for future data streams
  • Wide range of data types unified on the C3 AI Suite, including case time series, genome sequences, medical images, mobility, hospital capacity, economic data, and journal articles (text)
  • Open data repository is continually enhanced with contributions and suggestions from developer and research community through additional data sources and ML models

For a list of most updated data sources included in the C3 AI COVID-19 Data Lake, please visit


different data sources integrated and unified using a single data image
data ingestion to keep the C3 AI COVID-19 Data Lake up to date
APIs to access the unified data using Python, R, Microsoft Power BI, and C3 AI Ex Machina
for easy start using popular analysis tools such as R and Python

