{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# C3.ai COVID-19 Data Lake Quickstart in Python  \n",
    "\n",
    "Version 3.0 (June 2, 2020).\n",
    "\n",
    "This Jupyter notebook shows some examples of how to access and use each of the [C3.ai COVID-19 Data Lake](https://c3.ai/covid/) APIs. These examples show only a small piece of what you can do with the C3.ai COVID-19 Data Lake, but will get you started with performing your own exploration. See the [API documentation](https://c3.ai/covid-19-api-documentation/) for more details.\n",
    "\n",
    "Please contribute your questions, answers and insights on [Stack Overflow](https://www.stackoverflow.com). Tag `c3ai-datalake` so that others can view and help build on your contributions. For support, please send email to: [covid@c3.ai](mailto:covid@c3.ai)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "- [Helper methods for accessing the API](#helpers)\n",
    "- [Access OutbreakLocation data](#outbreaklocation)\n",
    "    - [Case counts](#outbreaklocation/casecounts)\n",
    "    - [Demographics](#outbreaklocation/demographics)\n",
    "    - [Mobility](#outbreaklocation/mobility)\n",
    "    - [Projections](#outbreaklocation/projections)\n",
    "- [Access LineListRecord data](#linelistrecord)\n",
    "- [Join BiologicalAsset and Sequence data](#biologicalasset)\n",
    "- [Access BiblioEntry data](#biblioentry)\n",
    "- [Join TherapeuticAsset and ExternalLink data](#therapeuticasset)\n",
    "- [Join Diagnosis and DiagnosisDetail data](#diagnosis)\n",
    "- [Access VaccineCoverage data](#vaccinecoverage)\n",
    "- [Access Policy data](#policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the [requests](https://requests.readthedocs.io/en/master/), [pandas>=1.0.0](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/3.2.1/index.html), and [scipy](https://www.scipy.org/) libraries before using this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "from scipy.stats import gamma\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensure that you have a recent version of pandas (>= 1.0.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"pandas version\", pd.__version__)\n",
    "assert pd.__version__[0] >= \"1\", \"To use this notebook, upgrade to the newest version of pandas. See https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html for details.\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"helpers\"></a>\n",
    "## Helper methods for accessing the API\n",
    "\n",
    "The helper methods below convert a JSON response from the C3.ai APIs to a Pandas DataFrame. Please run this cell before using the quickstart examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_data_json(typename, api, body):\n",
    "    \"\"\"\n",
    "    read_data_json directly accesses the C3.ai COVID-19 Data Lake APIs using the requests library, \n",
    "    and returns the response as a JSON, raising an error if the call fails for any reason.\n",
    "    ------\n",
    "    typename: The type you want to access, i.e. 'OutbreakLocation', 'LineListRecord', 'BiblioEntry', etc.\n",
    "    api: The API you want to access, either 'fetch' or 'evalmetrics'.\n",
    "    body: The spec you want to pass. For examples, see the API documentation.\n",
    "    \"\"\"\n",
    "    response = requests.post(\n",
    "        \"https://api.c3.ai/covid/api/1/\" + typename + \"/\" + api, \n",
    "        json = body, \n",
    "        headers = {\n",
    "            'Accept' : 'application/json', \n",
    "            'Content-Type' : 'application/json'\n",
    "        }\n",
    "    )\n",
    "    response.raise_for_status()\n",
    "    \n",
    "    return response.json()\n",
    "\n",
    "def fetch(typename, body, get_all = False, remove_meta = True):\n",
    "    \"\"\"\n",
    "    fetch accesses the C3.ai COVID-19 Data Lake using read_data_json, and converts the response into a Pandas dataframe. \n",
    "    fetch is used for all non-timeseries data in the C3.ai COVID-19 Data Lake, and will call read_data as many times \n",
    "    as required to access all of the relevant data for a given typename and body.\n",
    "    ------\n",
    "    typename: The type you want to access, i.e. 'OutbreakLocation', 'LineListRecord', 'BiblioEntry', etc.\n",
    "    body: The spec you want to pass. For examples, see the API documentation.\n",
    "    get_all: If True, get all records and ignore any limit argument passed in the body. If False, use the limit argument passed in the body. The default is False.\n",
    "    remove_meta: If True, remove metadata about each record. If False, include it. The default is True.\n",
    "    \"\"\"\n",
    "    if get_all:\n",
    "        has_more = True\n",
    "        offset = 0\n",
    "        limit = 2000\n",
    "        df = pd.DataFrame()\n",
    "\n",
    "        while has_more:\n",
    "            body['spec'].update(limit = limit, offset = offset)\n",
    "            response_json = read_data_json(typename, 'fetch', body)\n",
    "            new_df = pd.json_normalize(response_json['objs'])\n",
    "            df = df.append(new_df)\n",
    "            has_more = response_json['hasMore']\n",
    "            offset += limit\n",
    "            \n",
    "    else:\n",
    "        response_json = read_data_json(typename, 'fetch', body)\n",
    "        df = pd.json_normalize(response_json['objs'])\n",
    "        \n",
    "    if remove_meta:\n",
    "        df = df.drop(columns = [c for c in df.columns if ('meta' in c) | ('version' in c)])\n",
    "    \n",
    "    return df\n",
    "    \n",
    "def evalmetrics(typename, body, get_all = False, remove_meta = True):\n",
    "    \"\"\"\n",
    "    evalmetrics accesses the C3.ai COVID-19 Data Lake using read_data_json, and converts the response into a Pandas dataframe.\n",
    "    evalmetrics is used for all timeseries data in the C3.ai COVID-19 Data Lake.\n",
    "    ------\n",
    "    typename: The type you want to access, i.e. 'OutbreakLocation', 'LineListRecord', 'BiblioEntry', etc.\n",
    "    body: The spec you want to pass. For examples, see the API documentation.\n",
    "    get_all: If True, get all metrics and ignore limits on number of expressions and ids. If False, consider expressions and ids limits. The default is False.\n",
    "    remove_meta: If True, remove metadata about each record. If False, include it. The default is True.\n",
    "    \"\"\"\n",
    "    if get_all:\n",
    "        expressions = body['spec']['expressions']\n",
    "        ids = body['spec']['ids']\n",
    "        df = pd.DataFrame()\n",
    "        \n",
    "        for ids_start in range(0, len(ids), 10):\n",
    "            for expressions_start in range(0, len(expressions), 4):\n",
    "                body['spec'].update(\n",
    "                    ids = ids[ids_start : ids_start + 10],\n",
    "                    expressions = expressions[expressions_start : expressions_start + 4]\n",
    "                )\n",
    "                response_json = read_data_json(typename, 'evalmetrics', body)\n",
    "                new_df = pd.json_normalize(response_json['result'])\n",
    "                new_df = new_df.apply(pd.Series.explode)\n",
    "                df = pd.concat([df, new_df], axis = 1)\n",
    "            \n",
    "    else:\n",
    "        response_json = read_data_json(typename, 'evalmetrics', body)\n",
    "        df = pd.json_normalize(response_json['result'])\n",
    "        df = df.apply(pd.Series.explode)\n",
    "\n",
    "    # get the useful data out\n",
    "    if remove_meta:\n",
    "        df = df.filter(regex = 'dates|data|missing')\n",
    "    \n",
    "    # only keep one date column\n",
    "    date_cols = [col for col in df.columns if 'dates' in col]\n",
    "    keep_cols =  date_cols[:1] + [col for col in df.columns if 'dates' not in col]\n",
    "    df = df.filter(items = keep_cols).rename(columns = {date_cols[0] : \"dates\"})\n",
    "    df[\"dates\"] = pd.to_datetime(df[\"dates\"])\n",
    "    \n",
    "    return df\n",
    "\n",
    "def getprojectionhistory(body, remove_meta = True):\n",
    "    \"\"\"\n",
    "    getprojectionhistory accesses the C3.ai COVID-19 Data Lake using read_data_json, and converts the response into a Pandas dataframe.\n",
    "    ------\n",
    "    body: The spec you want to pass. For examples, see the API documentation.\n",
    "    remove_meta: If True, remove metadata about each record. If False, include it. The default is True.\n",
    "    \"\"\"  \n",
    "    response_json = read_data_json(\"outbreaklocation\", 'getprojectionhistory', body)\n",
    "    df = pd.json_normalize(response_json)\n",
    "    df = df.apply(pd.Series.explode)\n",
    "\n",
    "    # get the useful data out\n",
    "    if remove_meta:\n",
    "        df = df.filter(regex = 'dates|data|missing|expr')\n",
    "    \n",
    "    # only keep one date column\n",
    "    date_cols = [col for col in df.columns if 'dates' in col]\n",
    "    keep_cols =  date_cols[:1] + [col for col in df.columns if 'dates' not in col]\n",
    "    df = df.filter(items = keep_cols).rename(columns = {date_cols[0] : \"dates\"})\n",
    "    df[\"dates\"] = pd.to_datetime(df[\"dates\"])\n",
    "    \n",
    "    # rename columns to simplify naming convention\n",
    "    df = df.rename(columns = lambda x: x.replace(\".value\", \"\"))\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation\"></a>\n",
    "## Access OutbreakLocation data\n",
    "\n",
    "OutbreakLocation stores location data such as countries, provinces, cities, where COVID-19 outbeaks are recorded. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation) for more details and for a list of available locations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch facts about Germany\n",
    "locations = fetch(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"filter\" : \"id == 'Germany'\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "locations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/casecounts\"></a>\n",
    "### Case counts\n",
    "\n",
    "A variety of sources provide counts of cases, deaths, recoveries, and other statistics for counties, provinces, and countries worldwide."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Total number of confirmed cases, deaths, and recoveries in Santa Clara, California\n",
    "today = pd.Timestamp.now().strftime(\"%Y-%m-%d\")\n",
    "\n",
    "casecounts = evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : [\"SantaClara_California_UnitedStates\"],\n",
    "            \"expressions\" : [\"JHU_ConfirmedCases\", \"JHU_ConfirmedDeaths\", \"JHU_ConfirmedRecoveries\"],\n",
    "            \"start\" : \"2020-01-01\",\n",
    "            \"end\" : today,\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "casecounts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot these counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    casecounts[\"dates\"],\n",
    "    casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedCases.data\"],\n",
    "    label = \"JHU_ConfirmedCases\"\n",
    ")\n",
    "plt.plot(\n",
    "    casecounts[\"dates\"],\n",
    "    casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedDeaths.data\"],\n",
    "    label = \"JHU_ConfirmedDeaths\"\n",
    ")\n",
    "plt.plot(\n",
    "    casecounts[\"dates\"],\n",
    "    casecounts[\"SantaClara_California_UnitedStates.JHU_ConfirmedRecoveries.data\"],\n",
    "    label = \"JHU_ConfirmedCases\"\n",
    ")\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Count\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Export case counts as a .csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment the line below to export the DataFrame as a .csv file\n",
    "# casecounts.to_csv(\"casecounts.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/demographics\"></a>\n",
    "### Demographics\n",
    "\n",
    "Demographic and economic data from the US Census Bureau and The World Bank allow demographic comparisons across locations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population = fetch(\n",
    "    \"populationdata\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"filter\" : \"!contains(parent, '_') && (populationAge == '>=65' || populationAge == 'Total') && gender == 'Male/Female' && year == '2018' && estimate == 'False' && percent == 'False'\"\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "population"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population_age_distribution = population.loc[\n",
    "    :, \n",
    "    [\"populationAge\", \"parent.id\", \"value\"]\n",
    "].pivot(index = \"parent.id\", columns = \"populationAge\")['value']\n",
    "population_age_distribution[\"proportion_over_65\"] = population_age_distribution[\">=65\"] / population_age_distribution[\"Total\"]\n",
    "\n",
    "population_age_distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Access global death counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_deaths = evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : list(population_age_distribution.index),\n",
    "            \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n",
    "            \"start\" : \"2020-05-01\",\n",
    "            \"end\" : \"2020-05-01\",\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "global_deaths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_deaths_by_country = global_deaths.filter(regex=(\"\\.data\")).melt()\n",
    "global_deaths_by_country[\"country\"] = global_deaths_by_country[\"variable\"].str.replace(\"\\..*\", \"\")\n",
    "\n",
    "global_comparison = global_deaths_by_country.set_index(\"country\").join(population_age_distribution)\n",
    "global_comparison[\"deaths_per_million\"] = 1e6 * global_comparison[\"value\"] / global_comparison[\"Total\"] \n",
    "global_comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.scatter(\n",
    "    global_comparison[\"proportion_over_65\"],\n",
    "    global_comparison[\"deaths_per_million\"]\n",
    ")\n",
    "plt.xlabel(\"Proportion of population over 65\")\n",
    "plt.ylabel(\"Confirmed COVID-19 deaths\\nper million people\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/mobility\"></a>\n",
    "### Mobility\n",
    "\n",
    "Mobility data from Apple and Google provide a view of the impact of COVID-19 and social distancing on mobility trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mobility_trends = evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : [\"DistrictofColumbia_UnitedStates\"],\n",
    "            \"expressions\" : [\n",
    "                \"Apple_WalkingMobility\", \n",
    "                \"Apple_DrivingMobility\",\n",
    "                \"Google_ParksMobility\",\n",
    "                \"Google_ResidentialMobility\"\n",
    "              ],\n",
    "            \"start\" : \"2020-03-01\",\n",
    "            \"end\" : \"2020-04-01\",\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "mobility_trends"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot these mobility trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    [100 for d in mobility_trends[\"dates\"]],\n",
    "    label = \"Baseline\",\n",
    "    linestyle = \"dashed\",\n",
    "    color = \"black\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_WalkingMobility.data\"],\n",
    "    label = \"Apple_WalkingMobility\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Apple_DrivingMobility.data\"],\n",
    "    label = \"Apple_DrivingMobility\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ParksMobility.data\"],\n",
    "    label = \"Google_ParksMobility\"\n",
    ")\n",
    "plt.plot(\n",
    "    mobility_trends[\"dates\"],\n",
    "    mobility_trends[\"DistrictofColumbia_UnitedStates.Google_ResidentialMobility.data\"],\n",
    "    label = \"Google_ResidentialMobility\"\n",
    ")\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Mobility compared to baseline (%)\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"outbreaklocation/projections\"></a>\n",
    "### Projections\n",
    "\n",
    "Use the GetProjectionHistory API to retrieve versioned time series projections for specific metrics made at specific points in time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve projections made between April 13 and May 1 of mean total cumulative deaths in Spain from April 13 to May 13\n",
    "projections = getprojectionhistory(\n",
    "    {\n",
    "        \"outbreakLocation\": \"Spain\", \n",
    "        \"metric\": \"UniversityOfWashington_TotdeaMean_Hist\",\n",
    "        \"metricStart\": \"2020-04-13\", \n",
    "        \"metricEnd\": \"2020-05-13\",\n",
    "        \"observationPeriodStart\": \"2020-04-13\",\n",
    "        \"observationPeriodEnd\": \"2020-05-01\"\n",
    "    }\n",
    ")\n",
    "\n",
    "projections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve actual total cumulative deaths in Spain from April 1 to May 13\n",
    "deaths = evalmetrics(\n",
    "    \"outbreaklocation\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"ids\" : [\"Spain\"],\n",
    "            \"expressions\" : [\"JHU_ConfirmedDeaths\"],\n",
    "            \"start\" : \"2020-04-01\",\n",
    "            \"end\" : \"2020-05-13\",\n",
    "            \"interval\" : \"DAY\",\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "deaths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (8, 6))\n",
    "plt.plot(\n",
    "    deaths[\"dates\"],\n",
    "    deaths[\"Spain.JHU_ConfirmedDeaths.data\"],\n",
    "    label = \"JHU_ConfirmedDeaths\",\n",
    "    color = \"black\"\n",
    ")\n",
    "for col in projections.columns:\n",
    "    if 'data' in col:\n",
    "        expr = projections[col.replace(\"data\", \"expr\")].iloc[0]\n",
    "        projection_date = pd.to_datetime(expr.split(\" \")[-1])\n",
    "        plt.plot(\n",
    "            projections.loc[projections[\"dates\"] >= projection_date, \"dates\"],\n",
    "            projections.loc[projections[\"dates\"] >= projection_date, col],\n",
    "            label = expr\n",
    "        )\n",
    "\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.ylabel(\"Count\")\n",
    "plt.title(\"Cumulative death count projections versus actual count\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"linelistrecord\"></a>\n",
    "## Access LineListRecord data\n",
    "\n",
    "LineListRecord stores individual-level crowdsourced information from laboratory-confirmed COVID-19 patients. Information includes gender, age, symptoms, travel history, location, reported onset, confirmation dates, and discharge status. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/LineListRecord) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch the line list records tracked by MOBS Lab\n",
    "records = fetch(\n",
    "    \"linelistrecord\",\n",
    "    {\n",
    "        \"spec\" : {\n",
    "            \"filter\" : \"lineListSource == 'DXY'\"\n",
    "        }\n",
    "    },\n",
    "    get_all = True\n",
    ")\n",
    "\n",
    "records"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the most common symptoms in this dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all the symptoms, which are initially comma-separated\n",
    "symptom_df = records.copy()\n",
    "symptom_df[\"symptoms\"] = symptom_df[\"symptoms\"].str.split(\", \")\n",
    "symptom_df = symptom_df.explode(\"symptoms\")\n",
    "symptom_df = symptom_df.dropna(subset = [\"symptoms\"])\n",
    "symptom_freq = symptom_df.groupby([\"symptoms\"]).agg(\"count\")[[\"id\"]].sort_values(\"id\")\n",
    "\n",
    "# Plot the data\n",
    "plt.figure(figsize = (10, 6))\n",
    "plt.bar(symptom_freq.index, symptom_freq[\"id\"])\n",
    "plt.xticks(rotation = 90)\n",
    "plt.xlabel(\"Symptom\")\n",
    "plt.ylabel(\"Number of patients\")\n",
    "plt.title(\"Common COVID-19 symptoms\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a patient is symptomatic and later hospitalized, how long does it take for them to become hospitalized after developing symptoms?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the number of days from development of symptoms to hospitalization for each patient\n",
    "hospitalized = records.dropna(subset = [\"hospitalAdmissionDate\", \"symptomStartDate\"])\n",
    "hospitalization_time = np.array(\n",
    "    pd.to_datetime(hospitalized['hospitalAdmissionDate']) - pd.to_datetime(hospitalized['symptomStartDate'])\n",
    ").astype('timedelta64[D]').astype('float')\n",
    "hospitalization_time = hospitalization_time[hospitalization_time >= 0]\n",
    "\n",
    "# Hospitalization time of 0 days is replaced with 0.1 to indicate near-immediate hospitalization\n",
    "hospitalization_time[hospitalization_time <= 0.1] = 0.1\n",
    "\n",
    "# Fit a gamma distribution\n",
    "a, loc, scale = gamma.fit(hospitalization_time, floc = 0)\n",
    "dist = gamma(a, loc, scale)\n",
    "\n",
    "# Plot the results\n",
    "x = np.linspace(0, np.max(hospitalization_time), 1000)\n",
    "n_bins = int(np.max(hospitalization_time) + 1)\n",
    "print(n_bins)\n",
    "\n",
    "plt.figure(figsize = (10, 6))\n",
    "plt.hist(\n",
    "    hospitalization_time, \n",
    "    bins = n_bins, \n",
    "    range = (0, np.max(hospitalization_time)), \n",
    "    density = True, \n",
    "    label = \"Observed\"\n",
    ")\n",
    "plt.plot(x, dist.pdf(x), 'r-', lw=5, alpha=0.6, label = 'Gamma distribution')\n",
    "plt.ylim(0, 0.5)\n",
    "plt.xlabel(\"Days from development of symptoms to hospitalization\")\n",
    "plt.ylabel(\"Proportion of patients\")\n",
    "plt.title(\"Distribution of time to hospitalization\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"biologicalasset\"></a>\n",
    "## Join BiologicalAsset and Sequence data\n",
    "\n",
    "BiologicalAsset stores the metadata of the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. Sequence stores the genome sequences collected from SARS-CoV-2 samples in the National Center for Biotechnology Information Virus Database. See the API documentation for [BiologicalAsset](https://c3.ai/covid-19-api-documentation/#tag/BiologicalAsset) and [Sequence](https://c3.ai/covid-19-api-documentation/#tag/Sequence) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Join data from BiologicalAsset & Sequence\n",
    "sequences = fetch(\n",
    "  \"biologicalasset\",\n",
    "  {\n",
    "    \"spec\" : {\n",
    "      \"include\" : \"this, sequence.sequence\",\n",
    "      \"filter\" : \"exists(sequence.sequence)\"\n",
    "    }\n",
    "  },\n",
    "  get_all = True\n",
    ")\n",
    "\n",
    "sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"biblioentry\"></a>\n",
    "## Access BiblioEntry data\n",
    "\n",
    "BiblioEntry stores the metadata about the journal articles in the CORD-19 Dataset. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/BiblioEntry) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch metadata for the first two thousand (2000) BiblioEntry journal articles approved for commercial use\n",
    "# Note that 2000 records are returned; the full dataset can be accessed using the get_all = True argument in fetch\n",
    "bibs = fetch(\n",
    "  \"biblioentry\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"hasFullText == true\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "# Sort them to get the most recent articles first\n",
    "bibs[\"publishTime\"] = pd.to_datetime(bibs[\"publishTime\"])\n",
    "bibs = bibs.sort_values(\"publishTime\", ascending = False)\n",
    "\n",
    "bibs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use GetArticleMetadata to access the full-text of these articles, or in this case, the first page text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bib_id = bibs.loc[0, \"id\"] \n",
    "print(bib_id)\n",
    "\n",
    "article_data = read_data_json(\n",
    "    \"biblioentry\",\n",
    "    \"getarticlemetadata\",\n",
    "    {\n",
    "        \"ids\" : [bib_id]\n",
    "    }\n",
    ")\n",
    "\n",
    "article_data[\"value\"][\"value\"][0][\"body_text\"][0][\"text\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"therapeuticasset\"></a>\n",
    "## Join TherapeuticAsset and ExternalLink data\n",
    "\n",
    "TherapeuticAsset stores details about the research and development (R&D) of coronavirus therapies, for example, vaccines, diagnostics, and antibodies. ExternalLink stores website URLs cited in the data sources containing the therapies stored in the TherapeuticAssets C3.ai Type. See the API documentation for [TherapeuticAsset](https://c3.ai/covid-19-api-documentation/#tag/TherapeuticAsset) and [ExternalLink](https://c3.ai/covid-19-api-documentation/#tag/ExternalLink) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Join data from TherapeuticAsset and ExternalLink (productType, description, origin, and URL links)\n",
    "assets = fetch(\n",
    "  \"therapeuticasset\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"include\" : \"productType, description, origin, links.url\",\n",
    "          \"filter\" : \"origin == 'Milken'\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "assets = assets.explode(\"links\")\n",
    "assets[\"links\"] = [link[\"url\"] if type(link) == dict and \"url\" in link.keys() else None for link in assets[\"links\"]]\n",
    "assets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"diagnosis\"></a>\n",
    "## Join Diagnosis and DiagnosisDetail data\n",
    "\n",
    "Diagnosis stores basic clinical data (e.g. clinical notes, demographics, test results, x-ray or CT scan images) about individual patients tested for COVID-19, from research papers and healthcare institutions. DiagnosisDetail stores detailed clinical data (e.g. lab tests, pre-existing conditions, symptoms) about individual patients in key-value format. See the API documentation for [Diagnosis](https://c3.ai/covid-19-api-documentation/#tag/Diagnosis) and [DiagnosisDetail](https://c3.ai/covid-19-api-documentation/#tag/DiagnosisDetail) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnoses = fetch(\n",
    "  \"diagnosis\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"contains(testResults, 'COVID-19')\", \n",
    "          \"include\" : \"this, diagnostics.source, diagnostics.key, diagnostics.value\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "diagnoses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnoses_long = diagnoses.explode(\"diagnostics\")\n",
    "diagnoses_long = pd.concat([\n",
    "    diagnoses_long.reset_index(),\n",
    "    pd.json_normalize(diagnoses_long[\"diagnostics\"])[[\"key\", \"value\"]]\n",
    "], axis = 1).set_index(\"index\").drop(columns = \"diagnostics\")\n",
    "diagnoses_long"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnoses_wide = diagnoses_long.pivot(columns = \"key\", values = \"value\")\n",
    "diagnoses_wide = pd.concat([diagnoses, diagnoses_wide], axis = 1).drop(columns = \"diagnostics\")\n",
    "diagnoses_wide"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the GetImageURLs API to view the image associated with a diagnosis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diagnosis_id = diagnoses_wide.loc[0, \"id\"] \n",
    "print(diagnosis_id)\n",
    "\n",
    "image_urls = read_data_json(\n",
    "    \"diagnosis\",\n",
    "    \"getimageurls\",\n",
    "    {\n",
    "        \"ids\" : [diagnosis_id]\n",
    "    }\n",
    ")\n",
    "\n",
    "print(image_urls[\"value\"][diagnosis_id][\"value\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"vaccinecoverage\"></a>\n",
    "## Access VaccineCoverage data\n",
    "\n",
    "VaccineCoverage stores historical vaccination rates for various demographic groups in US counties and states, based on data from the US Centers for Disease Control (CDC). See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/VaccineCoverage) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vaccine_coverage = fetch(\n",
    "  \"vaccinecoverage\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"filter\" : \"vaxView == 'Influenza' && contains(vaccineDetails, 'General Population') && (location == 'California_UnitedStates' || location == 'Texas_UnitedStates') && contains(demographicClass, 'Race/ethnicity') && year == 2018\"\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "vaccine_coverage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How does vaccine coverage vary by race/ethnicity in these locations?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vaccine_coverage[\"upperError\"] = vaccine_coverage[\"upperLimit\"] - vaccine_coverage[\"value\"]\n",
    "vaccine_coverage[\"lowerError\"] = vaccine_coverage[\"value\"] - vaccine_coverage[\"lowerLimit\"]\n",
    "\n",
    "plt.figure(figsize = (10, 6))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "plt.bar(\n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"demographicClassDetails\"], \n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"value\"], \n",
    "    yerr = [\n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"upperError\"], \n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"California_UnitedStates\", \"lowerError\"], \n",
    "    ]\n",
    ")\n",
    "plt.ylabel(\"Vaccination rate (%)\")\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.title(\"California, United States\")\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "plt.bar(\n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"demographicClassDetails\"], \n",
    "    vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"value\"], \n",
    "    yerr = [\n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"upperError\"], \n",
    "        vaccine_coverage.loc[vaccine_coverage[\"location.id\"] == \"Texas_UnitedStates\", \"lowerError\"], \n",
    "    ]\n",
    ")\n",
    "plt.ylabel(\"Vaccination rate (%)\")\n",
    "plt.xticks(rotation = 45, ha = \"right\")\n",
    "plt.title(\"Texas, United States\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"policy\"></a>\n",
    "## Access Policy data\n",
    "\n",
    "Policy stores COVID-19 social distancing and health policies and regulations enacted by US states. See the [API documentation](https://c3.ai/covid-19-api-documentation/#tag/Policy) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "policies = fetch(\n",
    "  \"policy\",\n",
    "  {\n",
    "      \"spec\" : {\n",
    "          \"limit\" : -1\n",
    "      }\n",
    "  }\n",
    ")\n",
    "\n",
    "policies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the AllVersionsForPolicy API to access historical and current versions of a policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "versions = read_data_json(\n",
    "    \"policy\",\n",
    "    \"allversionsforpolicy\",\n",
    "    {\n",
    "        \"this\" : {\n",
    "            \"id\" : \"Wisconsin_UnitedStates_Policy\"\n",
    "        }\n",
    "    }\n",
    ")\n",
    "\n",
    "pd.json_normalize(versions)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}