May 27, 2020

Tom Siebel’s Charts New Data Lake For COVID-19 Research

The enterprise data estate has grown. Enterprise business has for a long time had databases (everybody knows what these are), bigger data warehouses (that bring together data from disparate sources) and smaller data marts (components of data warehouses specific to an individual task, job or line of business)… all of which form the lion’s share of the data estate.

On top of all of these data streams, there’s data on user devices, data on intelligent machines inside the Internet of Things (IoT) and other types of data out at the ‘edge’ of networks, systems and companies, hence the term edge computing.

Then, there’s the data lake.

What is a data lake?

A data lake is a pool, but its contents may be murky in places. Some of the data in the data lake is unstructured (video, voice and email data is unstructured, as compared to spreadsheets and more ordered data sets) and it often exists in its ‘raw’ machine-level format. We often think about the data lake being quite a disordered and voluminous repository, but as with many things… order can be distilled from chaos.

Bringing the fractal nature of the data lake into workable shape involves data deduplication, filtering and federation. This is the mission of tech billionaire Tom Siebel’s company. The eponymously named Siebel Systems was an early pioneer in Customer Relationship Management (CRM) software before it was ultimately acquired by Oracle in 2006. Siebel then founded in 2009 to provide what he has called ‘industrial scale Artificial Intelligence’ (AI) to tasks like the federation and unification of data in data lake resources.

With the huge glut of information that needs to be brought into order to service COVID-19 (Coronavirus) research, the company has now added 11 new integrated data sets to its trademarked COVID-19 Data Lake. Siebel and team claim that this makes it one of the largest pre-integrated and free sources of COVID-19 data in the world.

Researchers looking for COVID-19 data can use the lake to get access to data that is unified (i.e. structured and readable by machine) and normalized (i.e. deduplicated, free of anomalies and brought into a form so that the data itself behaves as it should given the ‘dependency’ relationships that a given database sets out and calls for) in order to aid the fight against COVID-19.

COVID-19 data is dispersed, divergent & difficult

The company reminds us that researchers are in a race to predict the virus’s continued trajectory as well as the extended effects of lockdown and track & trace initiatives. They are also looking to forecast demand for intensive care unit bed capacity, analyze the efficacy of COVID-19 guidelines, support COVID-19 diagnosis and speed the development of medical treatments. The challenge is that most data sets in this space are dispersed in a variety of different locations and in unusable formats. Without rich, integrated data sets, it is impossible to develop meaningful and accurate artificial intelligence models.

This data lake said to be structurally different from other COVID-19 data collections in that it provides analysis-ready data that researchers can use immediately to enhance new or ongoing COVID projects. Researchers at MIT, in collaboration with the Federal Emergency Management Agency (FEMA) and other agencies, are focused on the analysis of critical supply chain issues to understand the distribution and availability of COVID-19 testing equipment and personal protective equipment (PPE) – and the pandemic’s impact on freight flows throughout the country.

”Having access to an integrated set of diverse COVID-19 data sources with a common data model can help accelerate analysis of critical supply chain issues in our work with FEMA and other agencies,” said Tim Russell, research engineer at the MIT Humanitarian Supply Chain Lab, MIT Center for Transportation & Logistics. “The COVID-19 Data Lake provides a valuable resource in unifying and simplifying access to the necessary data without having to waste time on finding, cleaning, and preparing the data for analysis.”

Can I take your data order sir/madam?

CEO Thomas M. Siebel explains that also is encouraging researchers to recommend data sources they would like to see added to the COVID-19 Data Lake for future research. For example, a physician from a leading hospital has requested add all U.S. vaccination data to the data lake to study the impact of previous vaccinations on the rate of hospitalizations and infections.

Where can these researchers get into the lake then? The cloud, obviously.

The COVID-19 Data Lake is single federated cloud image, updated in real-time with pre-established linkages so researchers can navigate and explore all of the associations within and across the data sources through a knowledge graph. Researchers can then apply advanced data science methods against the corpus of all COVID-19 data.

We’re getting there, hopefully, one step at a time and data (in the shape of data management) is of course playing an important role. We all know that we may find a lot of things work differently in the aftermath of COVID-19 and part of that new normal could be the more widespread use of data lake technologies for all businesses. Indeed, Siebel’s focuses on manufacturing, oil & gas, banking, utilities, retail, transportation, telecommunications, aerospace & defense as well as smart cities when it’s not working with healthcare data.

Data will help us, but please still wash your hands.

Read the full article here.