Building CovidCureId.com
Since the very beginning of the COVID pandemic, the German news outlet Deutsche Welle has been providing some of the best global coverage on the topic. DW’s coverage of COVID available on YouTube is particularly good as it provides relevant snippets of coverage with very little drama and fanfare in short, digestible segments.
Watching the recent coverage of the situation unfolding in India and reading first hand accounts from the folks in r/india, I felt inspired to revisit the CURE ID database with the thought that the COVID case data available may be of value where access to therapeutics may be limited.
I decided to build covidcureid.com to surface just the COVID case data from CURE ID with the (perhaps misguided) hope of helping in some small way by making this case data more accessible by focusing on finding cases by three simple facets:
- Age
- Gender
- Drug and drug efficacy data.
A cleaner, easier to navigate, more responsive, mobile-friendly user interface helps improve the usability of locating cases of interest and understanding different real-world outcomes.
Tapping through allows access to the individual cases filtered by a patient’s age and gender. Case notes are included and a direct link to the case in CURE ID is available as CURE ID displays additional context and information about the case including community discussions, dosing information, and other details.
The open-source project starts by extracting the data from the CURE ID database through the REST APIs. From there, the dataset is then loaded into a Microsoft Azure CosmosDB instance which is an ideal endpoint for storing the extracted JSON formatted case data.
The front-end UI is developed using Vue.js and the excellent Quasar framework with mobile devices in mind. All of it is hosted in an Azure Static Web App to improve performance while reducing cost.
The application backend is built on Azure Functions — again, improving performance and scalability while reducing cost.
There is much more detail available at the Github project page for anyone interested. I also have a writeup on LinkedIn which talks a bit more about the CURE ID application itself.
In this post, we’ll walk through the methodology and implementation of the app.
Understanding the Data
Unlike the ClinicalTrials.gov application which provides a well-documented REST API for interacting with the data, CURE ID does not provide a documented API.
However, we can see the requests in action by simply popping open the browser’s developer tools:
Navigating to the link reveals that CURE ID is serving the API from a Django application likely running in AWS:
With this, we can start to extract the data.
Extracting the CURE ID COVID Data
Looking at the URLs, there are two queries that we need to extract the case information from CURE ID:
1 2 3 4 5 6 |
# First we need to extract the list of ALL treatments # 630 is the ID of COVID https://cure-api2.ncats.io/v1/drugs?disease=630&no_page # Then we need to iterate and execute a query for each treatment https://cure-api2.ncats.io/v1/reports?disease=630&drugs_id=<TREATMENT_ID>&outcome_computed=&no_page= |
A simple Node JavaScript file is needed to pull this information down and store it in .json files to explore:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
(async() => { const axios = require('axios') const fs = require('fs') console.log('Fetching data from CURE ID...') // Start by grabbing the root level listing from the CURE ID database const treatmentsResponse = await axios({ url: 'https://cure-api2.ncats.io/v1/drugs?disease=630&no_page', method: 'GET' }) const treatmentsResult = treatmentsResponse.data // This gives us the list of all treatments. fs.writeFile( 'raw-files/01-covid-treatment-listing.json', JSON.stringify(treatmentsResult), function(err) { console.log(err) } ) // And then for each treatment, we retrive the list of possible combinations. for(const treatment of treatmentsResult) { if (!(treatment.id === 11388 || treatment.id === 11364)) { continue } console.log(` Fetching case data for: ${treatment.name} (${treatment.id})`) const treatmentResponse = await axios({ url: `https://cure-api2.ncats.io/v1/reports?disease=630&drugs_id=${treatment.id}&outcome_computed=&no_page=`, method: 'GET' }) const caseResults = treatmentResponse.data fs.writeFile( `raw-files/02-${treatment.id}-${treatment.name.replace('/', '--')}.json`, JSON.stringify(caseResults), function(err) { console.log(err) } ) console.log(` Saved ${caseResults.length} records`) } })() |
It is possible to skip this step of saving the file and directly feed the data into an ETL pipeline instead, but without a schema or documentation, it helps to have the files to explore the data.
Importing the Data
The next question is how to get the data into CosmosDB. We can import the data as-is since CosmosDB is a document-oriented database and we already have the data in JSON files at this point, but there’s also a lot of information in each case record that’s not interesting and I also want to try to limit the throughput so that the overall RU’s required in CosmosDB stays below the free tier limit of 400 RU/s.
(It is also possible to use actual ETL capabilities in Azure as well, but we want to do this for free if we can)
To achieve this, we can use a four step hop:
- Put the files into Azure Storage Blobs
- Use Azure Storage Blob triggers an Azure Function to transform the data
- Push the transformed data into an Azure Storage Queue to buffer the throughput to Azure CosmosDB
- Write forward the data to Azure CosmosDB from a Queue triggered Function
Loading the data can be done manually using either Azure Storage Explorer or from the command line using the Azure CLI:
1 2 3 4 5 |
az storage blob upload-batch ^ --destination covidcureid-raw-files ^ --source "${rawDataFileDir}" ^ --pattern "02-*.json" ^ --connection-string "UseDevelopmentStorage=true" |
This will then trigger a Function that is listening for new files:
1 2 3 4 5 6 7 8 9 |
[FunctionName("ProcessDataFile")] public void ProcessDataFile( [BlobTrigger("covidcureid-raw-files/{name}", Connection = "AzureWebJobsStorage")]Stream blob, string name, [Queue("covidcureid-queue-drug")] ICollector<DrugEntry> drugEntryCollector, [Queue("covidcureid-queue-regimen")] ICollector<RegimenEntry> regimenEntryCollector, ILogger log) { // Transform and push the data into queues for buffering. } |
We have three bindings for this Function:
- The BLOB storage binding which ingests files from the Azure Storage endpoint
- A queue binding to push drug entries we want to store
- A queue binding to push regimen entries we want to store
A regimen represents a set of drugs that were used in the treatment of the patients. For the user interface, we want to report by individual drugs and then allow drill down into the regimens that the drug was used in so we break out the drugs from the regimens and create discrete entries for each occurrence of a drug in the case files.
To throttle the ingest, we can set the batch size for the queues to reduce the throughput we need to CosmosDB to try to keep it below 400 RU/s (free tier limit):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "version": "2.0", "logging": { "applicationInsights": { "samplingExcludedTypes": "Request", "samplingSettings": { "isEnabled": true } } }, "extensions": { "queues": { "batchSize": 1, "maxDequeueCount": 3 } } } |
For the drug entries, we can directly pass the entry to CosmosDB:
1 2 3 4 5 6 7 8 |
[FunctionName("ProcessDrugEntry")] public void ProcessDrugEntry( [QueueTrigger("covidcureid-queue-drug")] DrugEntry entry, [CosmosDB(databaseName: "CovidCureId", collectionName: "CaseFiles", ConnectionStringSetting = "CovidCureIdCosmosDb")] out dynamic document, ILogger log) { document = entry; } |
But the regimens are duplicated with one entry created for each drug in the regimen. For example, if the regimen contains Aspirin and Hydroxychloroquine, there are two case files (two regimens) in the dataset. So in this case, we need to ensure we do not have duplicates by checking to see if we already have an instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[FunctionName("ProcessRegimenEntry")] public void ProcessRegimenEntry( [QueueTrigger("covidcureid-queue-regimen")] RegimenEntry entry, [CosmosDB(databaseName: "CovidCureId", collectionName: "CaseFiles", ConnectionStringSetting = "CovidCureIdCosmosDb")] out dynamic document, ILogger log) { // Check to see if there is already an entry for this regimen and do not duplicate the regimen entry Task<RegimenEntry> lookupTask = _regimens.Find(r => r.RegimenId == entry.RegimenId); lookupTask.Wait(); RegimenEntry existingEntry = lookupTask.Result; if(existingEntry == null) { // Only set the entry if we don't already have an entry for this regimen. document = entry; } else { document = null; } } |
This implementation is not foolproof since unlike SQL, we don’t have transactional control over the data store (transactional at the document level). But it’s good enough for the purposes of this analysis. To really ensure that single entry, the application would have to use Azure Service Bus Queues with a session ID that could ensure once-only entry of the data.
Application Front End
While somewhat heavy for such a simple application, the application is built using:
- Vue.js
- Quasar Framework (using TypeScript and the Vue Composition API)
- Apex Charts
The combination is highly productive and allows a rapid build-out of the front-end with minimal fuss.
Though, in retrospect, a lighter framework like Alpine + Tailwind may have been an even better choice to reduce the overall network and browser load.
(See the repository for the code)
API Backend
For the API, we once again turn to Functions to build a low-cost, serverless solution.
There are two simple REST endpoints that we need.
First, an endpoint to query by drugs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
[FunctionName("QueryByDrugs")] public async Task<IActionResult> QueryByDrugs( [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "query/drug/by/{age}/{gender}")] HttpRequest req, int age, string gender, ILogger log) { // Find all of the drugs entries that match the age range and gender. QueryDefinition query = new QueryDefinition(@" SELECT c.DrugName, c.DrugId, SUM(c.Improved) AS Improved, SUM(c.Deteriorated) AS Deteriorated, SUM(c.Undetermined) AS Undetermined FROM c WHERE c.EntryType = 'Drug' AND @age >= c.AgeLowerBound AND @age <= c.AgeUpperBound AND LOWER(c.Gender) = @gender GROUP BY c.DrugName, c.DrugId") .WithParameter("@age", age) .WithParameter("@gender", gender.ToLowerInvariant()); List<AggregateResult> results = await _drugs.Query<AggregateResult, DrugEntry>(query); return new OkObjectResult(results); } |
When retrieving the drug information, we aggregate by the outcome so that we can use that information on the front-end to build the chart.
Then an endpoint to query regimens using the drug:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
[FunctionName("QueryByRegimen")] public async Task<IActionResult> QueryByRegimen( [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "query/regimen/by/{age}/{gender}/{drugId}")] HttpRequest req, int age, string gender, int drugId, ILogger log) { // Find all of the drugs entries that match the age range and gender. QueryDefinition query = new QueryDefinition(@" SELECT c.Id, c.RegimenName, c.CountryTreated, c.RegimenId, c.OutcomeComputed, c.Unusual, c.AdditionalInfo, c.AdverseEvents FROM CaseFiles c JOIN r IN c.RegimenDrugs WHERE @age >= c.AgeLowerBound AND @age <= c.AgeUpperBound AND LOWER(c.Gender) = @gender AND r.CureId = @drugId") .WithParameter("@age", age) .WithParameter("@gender", gender.ToLowerInvariant()) .WithParameter("@drugId", drugId); List<RegimenResult> results = await _drugs.Query<RegimenResult, DrugEntry>(query); return new OkObjectResult(results); } |
Note the JOIN operation. In CosmosDB, this is an intra-document operation that allows reshaping the JSON.
Deploying to Azure
There are three parts of the application that we need to deploy:
- The static front-end assets for the website
- The API layer for the website
- The backend components for the website (Cosmos, Storage)
Azure Static Web Apps could condense the first two bullet points, but it is unfortunately limited to only HTTP bindings. Rather than creating the HTTP bindings in Static Web Apps and the data processing in a separate Functions app, we just put everything into two separate endpoints.
I admit: I cheated here and used the UI to configure the services as the CLI commands for Static Web Apps is not documented well for cases where you’re not hooking up a Github URL.
Once the pieces are set up, we need to configure actions in Github to build and push the output to Azure.
First is the build and deploy process for the Static Web App:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# This workflow builds and deploys the front-end to Azure Static Web Apps name: Build and Deploy Web # Controls when the action will run. on: # Allows you to run this workflow manually from the Actions tab workflow_dispatch: # A workflow run is made up of one or more jobs that can run sequentially or in parallel jobs: # This workflow contains a single job called "build" build: # The type of runner that the job will run on runs-on: ubuntu-latest # Steps represent a sequence of tasks that will be executed as part of the job steps: # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it - uses: actions/checkout@v2 # Setup node - name: Setup Node uses: actions/setup-node@v1 with: node-version: '14.x' - name: Cache Node.js modules uses: actions/cache@v2 with: # npm cache files are stored in `~/.npm` on Linux/macOS path: ~/.npm key: ${{ runner.OS }}-node-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.OS }}-node- ${{ runner.OS }}- - name: Install dependencies working-directory: ./web run: npm install - name: Install Quasar CLI working-directory: ./web run: npm install -g @quasar/cli - name: Build for production with Quasar CLI working-directory: ./web run: API_ENDPOINT=${{ secrets.API_ENDPOINT }} GA_TOKEN=${{ secrets.GA_TOKEN }} quasar build # Prepare for deployment of the output to Azure # See: https://docs.microsoft.com/en-us/azure/static-web-apps/github-actions-workflow#build-and-deploy - name: Build And Deploy id: builddeploy uses: Azure/static-web-apps-deploy@v1 with: azure_static_web_apps_api_token: ${{ secrets.AZURE_CREDENTIALS }} repo_token: ${{ secrets.GITHUB_TOKEN }} # Used for GitHub integrations (i.e. PR comments) action: 'upload' ###### Repository/Build Configurations - These values can be configured to match you app requirements. ###### app_location: '/web/dist/spa' # App source code path skip_app_build: true ###### End of Repository/Build Configurations ###### |
Note the highlighted lines for building the Quasar app and deploying the output to Azure. Line 54 injects both the URL of the API and the Google Analytics token into the build.
For Quasar, the correct place to “receive” these settings is in the quasar.conf.js file at build.env :
1 2 3 4 5 6 7 8 |
build: { // Other bits here... env: { API_ENDPOINT: process.env.API_ENDPOINT, GA_TOKEN: process.env.GA_TOKEN } } |
The Google Analytics token can then be injected into the HTML template without any additional webpack plugins:
1 2 3 4 5 6 7 8 9 |
<!-- Global site tag (gtag.js) - Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=<%= process.env.GA_TOKEN %>"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', '<%= process.env.GA_TOKEN %>'); </script> |
Next is the build and deploy process for Functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# Deploys the API portion of the application to Azure. # See: https://github.com/marketplace/actions/azure-functions-action name: Build and Deploy API on: workflow_dispatch: # CONFIGURATION env: AZURE_FUNCTIONAPP_NAME: CovidCureIdApp # set this to your application's name AZURE_FUNCTIONAPP_PACKAGE_PATH: './server/CovidCureIdApp' # set this to the path to your web app project, defaults to the repository root DOTNET_VERSION: '3.1.409' # set this to the dotnet version to use jobs: build-and-deploy: runs-on: ubuntu-latest environment: dev steps: - name: 'Checkout GitHub Action' uses: actions/checkout@master - name: Setup DotNet ${{ env.DOTNET_VERSION }} Environment uses: actions/setup-dotnet@v1 with: dotnet-version: ${{ env.DOTNET_VERSION }} - name: 'Resolve Project Dependencies Using Dotnet' shell: bash run: | pushd './${{ env.AZURE_FUNCTIONAPP_PACKAGE_PATH }}' dotnet build --configuration Release --output ./output popd - name: 'Run Azure Functions Action' uses: Azure/functions-action@v1 id: fa with: app-name: ${{ env.AZURE_FUNCTIONAPP_NAME }} package: '${{ env.AZURE_FUNCTIONAPP_PACKAGE_PATH }}/output' publish-profile: ${{ secrets.AZURE_FUNC_PUBLISH_PROFILE }} # Obtain from Azure portal and paste into GH Secrets |
The scripts rely on a number of secrets configured in Github to keep your secrets out of source control:
- API_ENDPOINT is the endpoint of the API that we want to inject during the build process
- AZURE_CREDENTIALS is the secret key used to access the Static Web App for publishing
- AZURE_FUNC_PUBLISH_PROFILE is the XML formatted publishing profile exported from Azure Functions to allow publishing to the Functions app
- GA_TOKEN is an optional Google Analytics token to inject into the front-end.
Why Azure?
Part of this exercise was to explore working with Azure Functions using only VS Code and .NET 5. The other part is productivity; I’ve touched on this before: I personally find Azure to be more productive than AWS, especially when it comes to linking I/O between different parts of your PaaS.
I also found the process of setting up the custom domain and SSL certificate for my custom domain much, much easier on Azure Static Web Apps than on AWS with S3 Static Web Sites.