The Best/Worst Kept “Secret” Data Repository in Life Sciences?

If you work in life sciences, at some point or another, you’ve surely come across

According to Wiki, it has its origins in the 80’s and may have been influenced by the HIV outbreak:

As a result of pressure from HIV-infected men in the gay community, who demanded better access to clinical trials, the U.S. Congress passed the Health Omnibus Programs Extension Act of 1988 (Public Law 100-607) which mandated the development of a database of AIDS Clinical Trials Information Services (ACTIS). This effort served as an example of what might be done to improve public access to clinical trials, and motivated other disease-related interest groups to push for something similar for all diseases.

The Food and Drug Administration Modernization Act of 1997 (Public Act 105-115) amended the Food, Drug and Cosmetic Act and the Public Health Service Act to require that the NIH create and operate a public information resource, which came to be called, tracking drug efficacy studies resulting from approved Investigational New Drug (IND) applications (FDA Regulations 21 CFR Parts 312 and 812). With the primary purpose of improving access of the public to clinical trials where individuals with serious diseases and conditions might find experimental treatments…

Today, a significant number of clinical trials — but not all — are registered in per the background page:

“ does not contain information about all the clinical studies conducted in the United States because not all studies are required by law to be registered (for example, observational studies and trials that do not study a drug, biologic, or device). See FDAAA 801 and the Final Rule for more information. However, the rate of study registration has increased over time as more policies and laws requiring registration have been enacted and as more sponsors and investigators have voluntarily registered their studies.

The registration requirements were expanded after Congress passed the FDA Amendments Act of 2007 (FDAAA). Section 801 of FDAAA (FDAAA 801) requires more types of trials to be registered and additional trial registration information to be submitted. The law also requires the submission of results for certain trials.”

Per the FDA, it does not contain “all” trials. However, this repository is fantastic as it contains a treasure trove of information. The FDA also provides a very well documented set of APIs to retrieve the information in XML, JSON, or CSV using simple HTTP GET requests. If you want to try it yourself, the site even provides handy demo pages to show you how to structure requests and the sample response!

But what if you want to answer a different type of question? What if you want to aggregate and search across this data? What if you want to dive deeper rather than processing individual data records?

Enter the CTTI AACT Database

The CTTI AACT database is “simply” a normalized PostgreSQL database that contains every single record in

“AACT is a publicly available relational database that contains all information (protocol and result data elements) about every study registered in Content is downloaded from daily and loaded into AACT.”

This database is refreshed daily and can be downloaded or, best of all, accessed ad-hoc with open source clients! It comes with a full data dictionary:

As well as a schema which describes all of the entity relationships:

Armed with this, we can now dig in and ask some interesting questions.

Connecting to the Database

The easiest way to connect to the database is to simply sign up and get an account (detailed instructions for how to connect to the database are also available at that link).

Once you’ve created an account, you’ll need a client to access the database. We’re going to use a tool called pgweb which is a simple, open-source, commandline tool with a web interface that provides access without having to install a bulky client.

Once you download the tool, you can run it from the comandline using the following command:

I’ve greyed out my username and password. All you need to do is to invoke the executable with the following format:

This will start a small local web server that serves a simple web application that connects to the target server. Then you open a browser to the URL http://localhost:8081 and you have access to all of the data in the database going back over 20 years!

Querying the Database

To query the database, you’ll need to know basic PostgreSQL and reference the schema I linked above to understand how to find the data that you’re looking for. Fear not, though, the schema is relatively straight forward with most records related by the nct_id field. In other words, if you know the NCT ID, you can pretty easily navigate this dataset, even if you don’t know SQL.

Let’s say we want to find the top sponsors in 2019 based on number of sites for trials submitted in 2019:

The NCI by far has the most number of sites in the system (but keep in mind that this may also be an artifact of better data reporting!)

What if we’re curious about the effect of COVID on the total number of sites registered? We can count the number of sites for studies submitted in 2019 versus 2020:



There are any number of questions that can be answered!

  • What sites have the most oncology trials?
  • What therapeutic areas are each sponsor focused in?
  • How many sites are there in each state?
  • Which states or countries are seeing the most activity with trial and site registration?

Now keep in mind: this data is not necessarily representative. Even though it contains trials and sites from all over the world, not every trial nor every site will be listed here. However, I think it can still be used as a proxy for the broader industry.

What Can You Do With This?

Of course, you can mine this data for trends and analysis for a variety of reasons whether you’re in academia and putting together a research paper or you’re a financial analyst trying to understand industry trends.

One powerful idea is that given that there are 20 years of data available, it may be possible to build a predictive model of the probability of technical and regulatory success (PTRS) for a given IND or a specific trial model based on the facets of the historical data as well as the known successful INDs. In other words, imagine a machine learning model that would allow a sponsor to calculate deltas in the PTRS based on the therapy area, number of sites, location of sites, the inclusion/exclusion criteria, and so on. If constructed, such a model could be used to determine the optimal design of the protocol and site selection criteria to minimize cost, maximize efficiency, and without risk to PTRS.

There are even a number of startups and companies which have likely built novel tools on the foundation of this database (and possibly incorporating other non-public databases). Trial Scout, for example, likely uses this database as the foundation of it’s trial listings:

FindMeCure is another interesting company I came across via Kunal Sampat’s ClinicalTrialPodcast:

There are a number of companies out there now vying to address the gaps in accessibility and usability of the website and extending its mission by providing much more consumer-friendly interfaces and value-added services on top of it with the goal of improving patient enrollment and also helping patients find clinical trials as a potential option for care.

At Zytonomy, I’ve found the database useful for a much more basic purpose: simplifying the setup of workspaces for clinical trial sites by pulling in the known information as well as the existing protocol when a trial is already registered in

You may also like...