Chemical scripts

Supported scripts

Name	Function
Substructure search	`\#{x.ChemSubstructureSearch}`
Find MCS	`\#{x.ChemFindMCS}`
Descriptors	`\#{x.ChemDescriptors}`
R-Groups	`\#{x.ChemGetRGroups}`
Fingerprints	`\#{x.ChemFingerprints}`
Similarity SPE	`\#{x.ChemSimilaritySPE}`
SMILES to InchI	`\#{x.ChemSmilesToInchi}`
SMILES to Canonical	`\#{x.ChemSmilesToCanonical}`
Chemical map identifiers	`\#{x.ChemMapIdentifiers}`
Butina cluster	`\#{x.ChemScripts:ButinaMoleculesClustering}`
Filter by catalogs	`\#{x.ChemScripts:FilterByCatalogs}`
Gasteiger partial charges	`\#{x.ChemScripts:GasteigerPartialCharges}`
Murcko scaffolds	`\#{x.ChemScripts:MurckoScaffolds}`
Similarity maps using fingerprints	`\#{x.ChemScripts:SimilarityMapsUsingFingerprints}`
Chemical space using tSNE	`\#{x.ChemScripts:ChemicalSpaceUsingtSNE}`
Two component reactions	`Chem:TwoComponentReaction`
Chemical space using UMAP	`\#{x.ChemScripts:ChemicalSpaceUsingUMAP}`
USRCAT	`\#{x.ChemScripts:USRCAT}`
Mutate	`[PLACEHOLDER]`
Solubility prediction	`\#{x.18b704d0-0b50-11e9-b846-1fa94a4da5d1."Predict Solubility"}`
Curate	`[PLACEHOLDER]`

The following table gives an indicative data for the performance of certain chemical functions:

Indicative performance of chemical functions

Function	Molecules	Execution time, s
ChemSubstructureSearch	1M	70
ChemFindMcs	100k	43
ChemDescriptors (201 descriptor)	1k	81
ChemDescriptors (Lipinski)	1M	164
ChemGetRGroups	1M	233
ChemFingerprints (TopologicalTorsion)	1M	782
ChemFingerprints (MACCSKeys)	1M	770
ChemFingerprints (Morgan/Circular)	1M	737
ChemFingerprints (RDKFingerprint)	1M	2421
ChemFingerprints (AtomPair)	1M	1574
ChemSmilesToInChI	1M	946
ChemSmilesToInChIKey	1M	389
ChemSmilesToCanonical	1M	331

Butina cluster

Uses desired similarity within the cluster, as defined by Tanimoto index, as the only input to the clustering program.

References:

Chemical space using tSNE

tSNE, short for t-distributed Stochastic Neighbor Embedding, is a data visualization tool designed to handle high-dimensional data. It achieves this by transforming the similarities between data points into joint probabilities, then minimizing the Kullback-Leibler divergence between the low-dimensional embedding and the original high-dimensional data. tSNE uses a non-convex cost function, meaning that different initializations can lead to different results. The following image illustrates the use of tSNE to visualize chemical space.

Chemical Space Using tSNE

References:

RDKit
tSNE

Chemical space using UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that can be used for visualization similarly to tSNE, but also for general non-linear dimensionality reduction.

Chemical Space Using UMAP

References:

Filter by catalogs

Screen out or reject undesirable molecules based on various criteria.

Filter sets:

PAINS: Pan assay interference patterns, separated into three sets (PAINS_A, PAINS_B, and PAINS_C).
BRENK: Filters unwanted functionality due to potential toxicity reasons or unfavorable pharmacokinetics.
NIH: Annotated compounds with problematic functional groups
ZINC: Filtering based on drug-likeness and unwanted functional groups.

References:

RDKit FilterCatalogs

Gasteiger partial charges

Visualizes atomic charges in a molecule.

Gasteiger Partial Charges

References:

Murcko scaffolds

Converts a column with molecules to Murcko scaffolds.

Murcko Scaffolds

References:

Mutate

Mutate molecules using different mechanisms:

Adding atoms
Adding bonds
Removing bonds

Mutations can be randomized using randomize flag. Mutation mechanisms and place will be in randomized for each mutation step.

References:

RDKit

Reactions

Reaction template is in SMARTS format. Reactants can be combined from two sets, or sequentially depending on the matrixExpansion flag.

Reactions

References:

Similarity maps using fingerprints

Visualizes the atomic contributions to the similarity between a molecule and a reference molecule.

Similarity Maps Using Fingerprints

References:

Solubility prediction

The H2O modeling engine was used to train the model using the "Solubility Train" dataset
(#{x.Demo:SolubilityTrain."Solubility Train"}). The modelling method used was "Generalized Linear Modeling".

Molecular descriptors used in the model:

MolWt: Molecular weight
Ipc: The information content of the coefficients of the characteristic polynomial of the adjacency matrix of a hydrogen-suppressed graph of a molecule
TPSA: Total polar surface area
LabuteASA: Labute's approximate surface area
NumHDonors: Number of hydrogen donors
NumHAcceptors: Number of hydrogen acceptors
MolLogP: Wildman-Crippen LogP value
HeavyAtomCount: Number of heavy atoms
NumRotatableBonds: Number of rotatable bonds
RingCount: Number of rings
NumValenceElectrons: Number of valence electrons

References:

USRCAT

USRCAT is an extension of the Ultrafast Shape Recognition (USR) algorithm, which is used for molecular shape-based virtual screening to discover new chemical scaffolds in compound libraries. USRCAT incorporates pharmacophoric information in addition to molecular shape, which enables it to distinguish between compounds with similar shapes but distinct pharmacophoric features.

USRCAT

References:

RDKit
USRCAT

Supported scripts​

Butina cluster​

Chemical space using tSNE​

Chemical space using UMAP​

Filter by catalogs​

Gasteiger partial charges​

Murcko scaffolds​

Mutate​

Reactions​

Similarity maps using fingerprints​

Solubility prediction​

USRCAT​

Supported scripts

Butina cluster

Chemical space using tSNE

Chemical space using UMAP

Filter by catalogs

Gasteiger partial charges

Murcko scaffolds

Mutate

Reactions

Similarity maps using fingerprints

Solubility prediction

USRCAT