Datagrok's chemical intelligence, performance, and analytical sophistication lets your organization bridge the gap between data systems, and gives scientists tools to make informed, data-driven decisions.
By combining cheminformatics expertise with the power and flexibility of the Datagrok platform, we created more than a molecule-aware spreadsheet, but rather a universal, extensible system for addressing the needs of the modern research organization.
Request a demoCheminformatics is the use of computer and informational techniques applied to a range of problems in the field of chemistry. These in silico techniques are used, for example, in pharmaceutical companies and academic settings in the process of drug discovery. These methods can also be used in chemical and allied industries in various other forms.
Datagrok provides first-class support for small molecules, as well as most popular building blocks for cheminformatics. It understand several popular notations for representing chemical (sub)structures, such as SMILES and SMARTS. Molecules can be rendered in either 2D or 3D with different visualization options. They can be sketched as well. Chemical properties, descriptors, and fingerprints can be extracted. Predictive models that accept molecules as an input can be easily trained, assessed, executed, deployed, reused by other scientists, and used in pipelines or in info panels.
Several toxicity and drug-likeness prediction models are supported. Substructure and similarity search works out-of-the box for imported data, and can be efficiently utilized for querying databases using Postgres chemical cartridge. To further explore collections of molecules, use advanced tools such as similarity and diversity search.
Simply import the dataset as you normally would - by opening a file, querying a database, connecting to a webservice, or by any other method. The platform is smart enough to automatically recognize chemical structures.
Sketch a molecule using the built-in editor, or retrieve one by entering compound identifier. The following compound identifiers are natively understood since they have a prefix that uniquely identifies source system: SMILES, InChI, InChIKey, CHEMBL, MCULE, comptox, and zinc. The rest of the 30+ identifier systems can be referenced by prefixing source name followed by colon to the identifier, i.e. 'pubchem:11122'.
Many viewers, such as grid, scatter plot, network diagram, tile viewer, bar chart, and trellis plot will recognize and render chemical structures.
Chemical intelligence tools are natively integrated into the platform, so in most cases the
appropriate functionality is automatically presented based on the user actions and context.
For instance, when user clicks on a molecule, it becomes a current object,
and its properties are shown in the context panel. To see
chemically-related actions
applicable for the specified column, right-click on the column, and look under
Current column | Chem
and Current column | Extract
. Alternatively, click on the
column of
interest, and expand the 'Actions' section in the context panel.
Check out Tools | Chemistry
to see additional functionality.
As always, it is a good idea to search for functionality using the smart search (Alt+Q), or
by opening the registry of available functions Help | Functions
.
Use 'Extract' popup menu to calculate the following properties: formula, drug likeness, acceptor count, donor count, logP, logD, polar surface area, rotatable bond count, stereo center count.
Chemical descriptors are numerical features extracted from chemical structures for molecular data mining, compound diversity analysis and compound activity prediction. In addition to properties, the platform also makes it easy to compute different sets of molecular descriptors. Supported descriptor sets are: Lipinski, Crippen, EState, EState VSA, Fragments, Graph, MolSurf, QED.
Fingerprints are a very abstract representation of certain structural features of a molecule. Similarity measures, calculations that quantify the similarity of two molecules, and screening, a way of rapidly eliminating molecules as candidates in a substructure search, are both processes that use fingerprints. Grok supports the following fingerprints: RDKFingerprint, MACCSKeys, AtomPair, TopologicalTorsion, Morgan/Circular.
We have implemented few tools that help scientists analyze a collection of molecules in terms of molecular similarity. Both tools are based on applying different distance metrics (such as Tanimoto) to fingerprints.
These tools can be used together as a collection browser. 'Diverse structures' window shows different classes of compounds present in the dataset; when you click on a molecule representing a class, similar molecules will be shown in the 'Similar structures' window.
In contrast to the physical predictive models, machine learning predictive models do not have any intrinsic knowledge about the physical and biological processes. Instead, they use techniques such as random forests or deep learning to discern mathematical relationships between empirical observations of small molecules and extrapolate them to predict chemical, biological and physical properties of novel compounds.
Datagrok enables machine learning predictive models by using chemical properties, descriptors, and fingerprints as features, and the observed properties as results when building predictive models. This lets scientists build predictive models that can be trained, assessed, executed, reused by other scientists, and used in pipelines.
See Cheminformatics predictive modeling for more details of building and applying a model.
References:
Grok lets users easily and efficiently convert molecule identifiers between different source systems, including proprietary company identifiers.
Supported sources are: chembl, pdb, drugbank, pubchem_dotf, gtopdb, ibm, kegg_ligand, zinc, nih_ncc, emolecules, atlas, chebi, fdasrs, surechembl, pubchem_tpharma, pubchem, recon, molport, bindingdb, nikkaji, comptox, lipidmaps, carotenoiddb, metabolights, brenda, pharmgkb, hmdb, nmrshiftdb2, lincs, chemicalbook, selleck, mcule, actor, drugcentral, rhea
To map the whole column containing identifiers, use function.
function.IUPAC name is located in the "Properties" panel.
In order to retrieve a single structure by an identifier, it might be handy to use Sketcher
Click on a molecule to select it as a current object. This will bring up this molecule's properties to the context panel. The following panels are part of the 'chem' plugin:
In addition to these pre-defined info panels, users can develop their own using any scripting language supported by the Grok platform. For example, .
To search for molecules within the table that contain specified substructure, click on the molecule column, and press Ctrl+F. To add a substructure filter to column filters, click on the '☰' icon on top of the filters, and select the molecular column under the 'Add column filter' submenu.
The maximum common substructure (MCS) problem is of great importance in multiple aspects of cheminformatics. It has diverse applications ranging from lead prediction to automated reaction mapping and visual alignment of similar compounds.
To find MCS for the column with molecules, run Chem | Find MCS
command from column's context
menu. To execute
it from the console, use chem:findMCS(tableName, columnName)
command.
R-Group Analysis is a common function in chemistry. Typically, it involves R-group decomposition, followed by the visual analysis of the obtained R-groups. Grok's chemically-aware Trellis Plot is a natural fit for such an analysis.
The scaffold concept is widely applied in medicinal chemistry. Scaffolds are mostly used to represent core structures of bioactive compounds. Although the scaffold concept has limitations and is often viewed differently from a chemical and computational perspective, it has provided a basis for systematic investigations of molecular cores and building blocks, going far beyond the consideration of individual compound series.
Applies a specified reaction to two columns containing molecules. The output table contains a row for each product produced by applying the reaction to the inputs. Each row contains the product molecule, index information, and the reactant molecules that were used.
'Do Matrix Expansion': If checked, each reactant 1 will be combined with each reactant 2 yielding the combinatorial expansion of the reactants. If not checked, reactants 1 and 2 will be combined sequentially, with the longer list determining the number of output rows.
See details here.
The following cheminformatics-related functions are exposed:
Some routines are implemented as scripts:
Function | Molecules | Execution time, s |
---|---|---|
ChemSubstructureSearch | 1M | 70 |
ChemFindMcs | 100k | 43 |
ChemDescriptors (201 descriptors) | 1k | 81 |
ChemDescriptors (Lipinski) | 1M | 164 |
ChemGetRGroups | 1M | 233 |
ChemFingerprints (TopologicalTorsion) | 1M | 782 |
ChemFingerprints (MACCSKeys) | 1M | 770 |
ChemFingerprints (Morgan/Circular) | 1M | 737 |
ChemFingerprints (RDKFingerprint) | 1M | 2421 |
ChemFingerprints (AtomPair) | 1M | 1574 |
ChemSmilesToInChI | 1M | 946 |
ChemSmilesToInChIKey | 1M | 389 |
ChemSmilesToCanonical | 1M | 331 |
Efficient substructure and similarity searching in a database containing information about molecules is a key requirement for any chemical information management system. This is typically done by installing a so-called chemical cartridge on top of a database server. The cartridge extends server's functionality with the molecule-specific operations, which are made efficient by using chemically-aware indexes, which are often based on molecular fingerprints. Typically, these operations are functions that can be used as part of the SQL query.
Datagrok provides mechanisms for the automated translation of queries into SQL statements for several commonly used chemical cartridges. We support the following ones:
See DB Substructure and similarity search for details.
See also:
References: