Public datasets

Datagrok hosts a number of public datasets that are ready to use for testing and prototyping:

Dataset	Description
Bioactive molecules with drug-like properties - ChEMBL data	A dataset from EBI's manually curated chemical database of bioactive molecules with drug-like properties
Clinical trials - AACT data	The AACT (Aggregate Analysis of ClinicalTrials.gov) dataset is a publicly available, comprehensive resource that contains information on every study registered in ClinicalTrials.gov, including protocol and result data elements for each clinical trial
Toxic chemical data - ToxCast data	EPA's most updated, publicly available high-throughput toxicity data on thousands of chemicals. This data is generated through the EPA's ToxCast research effort. The dataset includes qualitative results of over 600 experiments on 8k compounds
Toxic chemical data - Tox21 Data Challenge 2014	A dataset created as a result of the initiative to create a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The data provides assay activity data and chemical structures on the Tox21 collection of ~10,000 compounds (Tox21 10K)
Drug side effects - EMBL's SIDER	The SIDER (Side Effect Resourse) dataset, which is a comprehensive database of marketed drugs and their assoictaed adverse reactions (ADR). The SIDER dataset in DeepChem groups drug side effects into 27 system organ classes following the MedDRA (Medical Dictionary for Regulatory Activities) classifications. The dataset covers 1,427 approved drugs and contains information on their chemical structures, associated ADRs, and the frequency of these side effect.
Bioassays: small molecules - PCBA	A subset of PubChem BioAssay (PCBA)' dataset containing biological activities of small molecules generated by high-throughput screening. The selection consists of 128 assays measured over 400,000 compounds.
MUV data	A benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis. The MUV dataset contains 17 challenging tasks for around 90,000 compounds and is specifically designed for validation of virtual screening techniques
Lipophilicity - ChEMBL data	A curated dataset from ChEMBL database with experimental results on octanol/water distribution coefficient (logD at pH=7.4).
AIDS antiviral screen data - NCI DTP data	A dataset with AIDS antiviral screen data, introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds

Synthetic datasets

In additiona to public datasets, you can use the following synthetic datasets:

Table name	Description
Demog	Synthetic demographics data
Biosensor	Simulates biosensor signal (3-axis accelerometer, temperature, and EDA)
Plates	Experimental plate data: barcode, row, col, volume
Random walk	N columns; each row value differs from a previous one by a small delta
Molecules	Chemical compounds in SMILES format and their lipophilicity properties
Geo	Information about specific data points in relation to their geographical coordinates
Stock prices	Information about stock prices (company ticker, date, price)
Dose-response	Information on effects of various compounds on cell viability
Cars	Information about cars
Customers	Customer ID and name
Orders	Information about customer orders (customer id, item, quantity, price)
Products	Information about products (product, id, category, price)

To access these datasets, follow these steps:

On the Sidebar, click the Hamburger icon > Tools > Dev > Open test dataset. An Open test dataset dialog opens.
In the dialog, set the desired number of rows and columns, and select the demo table.
Click OK to open the generated test dataset in Datagrok.

tip

You can connect to public providers, such as OpenWeatherMap, Alpha Vantage, commerce.gov, etc. by importing their swagger file. To learn more about connecting to webservices, see OpenAPI.

Public datasets​

Synthetic datasets​

Public datasets

Synthetic datasets