Cheminformatics
In addition to being a general-purpose extensible platform for scientific computing, Datagrok provides multiple options for developing cheminformatics solutions on top of it. Depending on your needs, use one or more of the following ones, or come up with your own solution.
- For Typical tasks such as calculating descriptors, substructure or similarity search, finding MCS or performing an R-group analysis, consider using the Datagrok JS API.
- For custom client-side computations, consider using either OpenChemLib.JS, or RDKit built for WebAssembly.
- For custom server-side computations, a popular option is using RDKit in Python. Python scripts can be seamlessly embedded into Datagrok via Scripting.
Datagrok JS API
We expose some commonly used functions via JS API. To invoke the methods, use grok.chem
.
Most of the methods are asynchronous. Behind the scenes they either use a
server-side assist, or a browser's separate Web Worker to call RDKit JS functions.
searchSubstructure(column, pattern, settings = { substructLibrary: true });
getSimilarities(column, molecule, settings = { /* reserved */ })
findSimilar(column, molecule, settings = { limit: ..., cutoff: ... })
diversitySearch(column, metric = METRIC_TANIMOTO, limit = 10);
rGroup(table, column, core);
mcs(column);
descriptors(table, column, descriptors);
searchSubstructure
, getSimilarities
and findSimilar
are client-based functions and
use RDKit JS. We plan to support both server-side and browser-side modes. The API remains the same.
Substructure search
searchSubstructure(column, pattern = null, settings = { substructLibrary: true })
This function performs substructure search using RDKit JS. It returns a BitSet.
If the i-th element in the input column
has the pattern's substructure, the i-th bit is set to 1; otherwise, it is set to 0.
column
stands for column that contains molecules in any
notation supported by RDKit JS: smiles, cxsmiles, molblock, v3Kmolblock, and inchi. Same applies to a pattern
: this is a string representation of a molecule in any of the previous notations.
The settings
object allows passing the following parameters:
molBlockFailover
: smarts which used as a substructure ifpattern
is invalid
To optimize the substructure search we uses a cache of pre-computed Pattern fingerprints and specific RDKit Mol objects. Fingerprints are used for preliminary filtration while Mol objects are used for graph-based search method get_substruct_match
. Fingerprints are calculated for each molecule in column
and Mol objects are created only for defined number of first molecules. It allows to decrease memory consumption and speed up the search. When the substructure search function encounters a new column
, it creates a cache of fingerprints and Mol objects for the molecules in it. Substructure search uses this cache when the function is called for the same column.
Molecule strings may be invalid or not supported by RDKit JS. During indexing we skip these inputs. The corresponding bit in the bitset remains 0.. Thus, the bitset is always of the input column
's length.
Similarity scoring
Similarity scoring functions
use Morgan fingerprints
and Tanimoto similarity (also known as Jaccard similarity) to rank molecules from a column
by their similarity to a given molecule
.
Identically to the convention of searchSubstructure
, a column
is a column of type String containing molecules, each
in any notation RDKit JS supports: smiles,
cxsmiles, molblock, v3Kmolblock, and inchi. Same applies to a molecule
: this is a string representation of a molecule
in any of the previous notations.
Find most similar molecules sorted by similarity
findSimilar(column, molecule, settings = { limit: ..., cutoff: ... })
The default settings are { limit: Number.MAX_VALUE, cutoff: 0.0 }
, thus the function ranks and sorts all molecules by
similarity.
Produces a Datagrok DataFrame
of three
columns (code sample):
molecule
column contains the original molecules from the inputcolumn
.score
column contains the corresponding similarity scores of the range from 0.0 to 1.0. The DataFrame is sorted descending by this column.index
column contains indices of the molecules in the original inputcolumn
.
Scoring each molecule against a given molecule
getSimilarities(column, molecule = null, settings = { sorted: false });
Produces a Datagrok DataFrame
with a
single Column
, where the i-th element contains a similarity score for
the i-th element of the
input Column
(code sample).
Similarly to the searchSubstructure
function, these functions maintain a cache of
pre-computed fingerprints. If findSimilar
or getSimilarities
function encounters a new column
, it creates a cache of fingerprints for the molecules in it. And uses this cache to compute similarity scores for the same column.
To only build (aka prime) this cache without performing an actual search, e.g. in case of pre-computing the fingerprints
in the UI, call the function passing molecule
as an empty string ''
.
If molecules are not supported by RDKit JS, we skip them during indexing. Null is returned instead of a fingerprint. Thus, the output Column (or DataFrame) is always of the input column
's length.
Examples (see on GitHub)
- Calculating descriptors
- Substructure search
- Diversity search
- Most common substructure
- R-Group analysis
Molecule rendering
SVG rendering with OpenChemLib
To render a molecule into a div
as an SVG, use grok.chem.svgMol
. This method uses OpenChemLib library.
ui.dialog()
.add(grok.chem.svgMol('O=C1CN=C(C2CCCCC2)C2:C:C:C:C:C:2N1', 300, 200))
.show();
Rendering to canvas with RdKit and scaffolds
To render a molecule, aligned to bonds, use grok.chem.canvasMol
. This method uses RDKit library.
To highlight a scaffold specify it as SMARTS.
You can specify the molecule string in any format supported by RdKit. Here is a complete example:
let root = ui.div();
let canvas = document.createElement('canvas');
canvas.width = 300; canvas.height = 200;
root.appendChild(canvas);
await grok.chem.canvasMol(
0, 0, canvas.width, canvas.height, canvas,
'COc1ccc2cc(ccc2c1)C(C)C(=O)OCCCc3cccnc3',
'c1ccccc1');
ui
.dialog({title:'Molecule'})
.add(root)
.show();
The method is currently asynchronous due to technical limitations. It will be made synchronous in the future.
Run the above examples live here: public.datagrok.ai.
Openchemlib.js
OpenChemLib.JS is a JavaScript port of the OpenChemLib Java library. Datagrok currently uses it for some of the cheminformatics-related routines that we run in the browser, such as rendering of the molecules.
Here is an example of manipulating atoms in a molecule using openchemlib.js.
Rdkit in python
RDKit in Python are Python wrappers for RDKit, one of the best open-source toolkits for cheminformatics. While Python scripts get executed on a server, they can be seamlessly embedded into Datagrok via Scripting.
Here are some RDKit in Python-based cheminformatics-related scripts in the public repository.
Rdkit in WebAssembly
Recently, Greg Landrum, the author of RDKit, has introduced a way to compile its C++ code to WebAssembly , thus allowing to combine the performance and efficiency of the carefully crafted C++ codebase with the ability to run it in the browser. This approach fits perfectly with Datagrok's philosophy of performing as much computations on the client as possible, so naturally we've jumped on that opportunity!
See also: