Chemical scripts
Supported scripts
Name | Function |
---|---|
Substructure search |
|
Find MCS |
|
Descriptors |
|
R-Groups |
|
Fingerprints |
|
Similarity SPE |
|
SMILES to InchI |
|
SMILES to Canonical |
|
Chemical map identifiers |
|
Butina cluster |
|
Filter by catalogs |
|
Gasteiger partial charges |
|
Murcko scaffolds |
|
Similarity maps using fingerprints |
|
Chemical space using tSNE |
|
Two component reactions |
|
Chemical space using UMAP |
|
USRCAT |
|
Mutate |
|
Solubility prediction |
|
Curate |
|
The following table gives an indicative data for the performance of certain chemical functions:
Indicative performance of chemical functions
Function | Molecules | Execution time, s |
---|---|---|
ChemSubstructureSearch | 1M | 70 |
ChemFindMcs | 100k | 43 |
ChemDescriptors (201 descriptor) | 1k | 81 |
ChemDescriptors (Lipinski) | 1M | 164 |
ChemGetRGroups | 1M | 233 |
ChemFingerprints (TopologicalTorsion) | 1M | 782 |
ChemFingerprints (MACCSKeys) | 1M | 770 |
ChemFingerprints (Morgan/Circular) | 1M | 737 |
ChemFingerprints (RDKFingerprint) | 1M | 2421 |
ChemFingerprints (AtomPair) | 1M | 1574 |
ChemSmilesToInChI | 1M | 946 |
ChemSmilesToInChIKey | 1M | 389 |
ChemSmilesToCanonical | 1M | 331 |
Butina cluster
Uses desired similarity within the cluster, as defined by Tanimoto index, as the only input to the clustering program.
References:
Chemical space using tSNE
tSNE, short for t-distributed Stochastic Neighbor Embedding, is a data visualization tool designed to handle high-dimensional data. It achieves this by transforming the similarities between data points into joint probabilities, then minimizing the Kullback-Leibler divergence between the low-dimensional embedding and the original high-dimensional data. tSNE uses a non-convex cost function, meaning that different initializations can lead to different results. The following image illustrates the use of tSNE to visualize chemical space.
References:
Chemical space using UMAP
Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that can be used for visualization similarly to tSNE, but also for general non-linear dimensionality reduction.
References:
Filter by catalogs
Screen out or reject undesirable molecules based on various criteria.
Filter sets:
- PAINS: Pan assay interference patterns, separated into three sets (PAINS_A, PAINS_B, and PAINS_C).
- BRENK: Filters unwanted functionality due to potential toxicity reasons or unfavorable pharmacokinetics.
- NIH: Annotated compounds with problematic functional groups
- ZINC: Filtering based on drug-likeness and unwanted functional groups.
References:
Gasteiger partial charges
Visualizes atomic charges in a molecule.
References:
Murcko scaffolds
Converts a column with molecules to Murcko scaffolds.
References:
- rdkit.Chem.Scaffolds.MurckoScaffold module
- Computational Exploration of Molecular Scaffolds in Medicinal Chemistry
- Comparative analyses of structural features and scaffold diversity for purchasable compound libraries
Mutate
Mutate molecules using different mechanisms:
- Adding atoms
- Adding bonds
- Removing bonds
Mutations can be randomized using randomize flag. Mutation mechanisms and place will be in randomized for each mutation step.
References:
Reactions
Reaction template is in SMARTS format. Reactants can be combined from two sets, or sequentially depending on the matrixExpansion flag.
References:
Similarity maps using fingerprints
Visualizes the atomic contributions to the similarity between a molecule and a reference molecule.
References:
- RDKit generating similarity maps using fingerprints
- Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods
Solubility prediction
The H2O modeling engine was used to train the model using the "Solubility Train" dataset
(#{x.Demo:SolubilityTrain."Solubility Train"}
). The modelling method used was "Generalized Linear Modeling".
Molecular descriptors used in the model:
- MolWt: Molecular weight
- Ipc: The information content of the coefficients of the characteristic polynomial of the adjacency matrix of a hydrogen-suppressed graph of a molecule
- TPSA: Total polar surface area
- LabuteASA: Labute's approximate surface area
- NumHDonors: Number of hydrogen donors
- NumHAcceptors: Number of hydrogen acceptors
- MolLogP: Wildman-Crippen LogP value
- HeavyAtomCount: Number of heavy atoms
- NumRotatableBonds: Number of rotatable bonds
- RingCount: Number of rings
- NumValenceElectrons: Number of valence electrons
References:
USRCAT
USRCAT is an extension of the Ultrafast Shape Recognition (USR) algorithm, which is used for molecular shape-based virtual screening to discover new chemical scaffolds in compound libraries. USRCAT incorporates pharmacophoric information in addition to molecular shape, which enables it to distinguish between compounds with similar shapes but distinct pharmacophoric features.
References: