Skip to main content

Public datasets

Public datasets

Datagrok hosts a number of public datasets that are ready to use for testing and prototyping:

DatasetDescription
Bioactive molecules with drug-like properties - ChEMBL dataA dataset from EBI's manually curated chemical database of bioactive molecules with drug-like properties
Clinical trials - AACT dataThe AACT (Aggregate Analysis of ClinicalTrials.gov) dataset is a publicly available, comprehensive resource that contains information on every study registered in ClinicalTrials.gov, including protocol and result data elements for each clinical trial
Toxic chemical data - ToxCast dataEPA's most updated, publicly available high-throughput toxicity data on thousands of chemicals. This data is generated through the EPA's ToxCast research effort. The dataset includes qualitative results of over 600 experiments on 8k compounds
Toxic chemical data - Tox21 Data Challenge 2014A dataset created as a result of the initiative to create a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The data provides assay activity data and chemical structures on the Tox21 collection of ~10,000 compounds (Tox21 10K)
Drug side effects - EMBL's SIDERThe SIDER (Side Effect Resourse) dataset, which is a comprehensive database of marketed drugs and their assoictaed adverse reactions (ADR). The SIDER dataset in DeepChem groups drug side effects into 27 system organ classes following the MedDRA (Medical Dictionary for Regulatory Activities) classifications. The dataset covers 1,427 approved drugs and contains information on their chemical structures, associated ADRs, and the frequency of these side effect.
Bioassays: small molecules - PCBAA subset of PubChem BioAssay (PCBA)' dataset containing biological activities of small molecules generated by high-throughput screening. The selection consists of 128 assays measured over 400,000 compounds.
MUV dataA benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis. The MUV dataset contains 17 challenging tasks for around 90,000 compounds and is specifically designed for validation of virtual screening techniques
Lipophilicity - ChEMBL dataA curated dataset from ChEMBL database with experimental results on octanol/water distribution coefficient (logD at pH=7.4).
AIDS antiviral screen data - NCI DTP dataA dataset with AIDS antiviral screen data, introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds

Synthetic datasets

In additiona to public datasets, you can use the following synthetic datasets:

Table nameDescription
DemogSynthetic demographics data
BiosensorSimulates biosensor signal (3-axis accelerometer, temperature, and EDA)
PlatesExperimental plate data: barcode, row, col, volume
Random walkN columns; each row value differs from a previous one by a small delta
MoleculesChemical compounds in SMILES format and their lipophilicity properties
GeoInformation about specific data points in relation to their geographical coordinates
Stock pricesInformation about stock prices (company ticker, date, price)
Dose-responseInformation on effects of various compounds on cell viability
CarsInformation about cars
CustomersCustomer ID and name
OrdersInformation about customer orders (customer id, item, quantity, price)
ProductsInformation about products (product, id, category, price)

To access these datasets, follow these steps:

  1. On the Sidebar, click the Hamburger icon > Tools > Dev > Open test dataset. An Open test dataset dialog opens.
  2. In the dialog, set the desired number of rows and columns, and select the demo table.
  3. Click OK to open the generated test dataset in Datagrok.
tip

You can connect to public providers, such as OpenWeatherMap, Alpha Vantage, commerce.gov, etc. by importing their swagger file. To learn more about connecting to webservices, see OpenAPI.