Skip to main content

Cheminformatics

Requirements

To access the cheminformatics functionality, install these packages using the Package Manager (on the Sidebar, click Manage > Packages):

  • Required. Chem.
  • Optional. Sketchers: The Chem package includes a built-in OpenChemLib Sketcher, but you can use your favorite sketcher, such as Ketcher (Apache License, Version 2.0), MarvinJS (commercial license), or ChemDraw (commercial license).
  • Optional. Chembl: Integration with the ChEMBL database deployed on your premises.
  • Optional. DrugBank: Integration with DrugBank (information on 11,300 drugs is included with the plugin).
  • Optional. Integration with external webservices (these packages transmit your data to external services):
    • ChemblAPI
    • PubChem
    • Enamine: Integration with Enamine, a service for online shopping for the chemical building blocks.
    • Chemspace: Integration with Chemspace, a service for online shopping for chemical building blocks.

Datagrok provides an intuitive interface and a wide range of tools for working with chemical data:

Data access

Datagrok provides a single, unified access point for organizations. You can connect to any of the 30+ supported data sources, retrieve data, and securely share data with others.

Chemical queries against data sources require a chemical cartridge, such as RDKit Postgres cartridge or JChem cartridge. These cartridges allow molecule-specific operations (like substructure or similarity searches) to be integrated into SQL queries using SQL syntax.

Example: Substructure search in a database

To create a chemically-aware query, use the SQL syntax specific to your cartridge. Annotate parameters like you would for a function. Here is an example of querying ChEMBL on Postgres with the RDKit cartridge installed:

Substructure search:

--name: @pattern substructure search
--connection: chembl
--input: string pattern {semType: Substructure}
--input: int maxRows = 1000
select molregno,m as smiles from rdk.mols where m@>@pattern
limit @maxRows

Similarity search:

--name: @pattern similarity search
--connection: chembl
--input: string pattern {semType: Substructure}
--input: int maxRows = 1000
select fps.molregno, cs.canonical_smiles as smiles
from rdk.fps fps
join compound_structures cs on cs.molregno = fps.molregno
where mfp2%morganbv_fp(@pattern)
limit @maxRows

To run a query, sketch the substructure and click OK. Datagrok retrieves the data and opens it in a spreadsheet.

DB Substructure and Similarity Search

To learn more about querying data and data access in general, see the Access section of our documentation.

Exploring chemical data

When you open a dataset, Datagrok automatically detects molecules and makes available molecule-specific context actions. For example, when you open a CSV file containing molecules in the SMILES format, the following happens:

  • Data is parsed, and the semantic type molecule is assigned to the corresponding column.
  • Molecules are automatically rendered in the spreadsheet.
  • Column tooltip now shows the most diverse molecules in your dataset.
  • Default column filter is now a sketcher-driven substructure search.
  • A top menu item labeled Chem appears.
  • When you click a molecule, the Context Panel on the right shows molecule-specific info panes, such as Toxicity or Drug Likeness.

Chem dataset exploration

The following info panes are shown by default for the current molecule:

The following info panes are shown by default for the current molecular column:

  • Most Diverse Structures
  • Rendering (offers rendering options for molecules):
    • Render a molecule or show its textual representation (Show structures).
    • Aligned to a particular scaffold (Scaffold).
    • Aligned to a scaffold defined in the specified column (Scaffold column).
    • Highlight the scaffold (Highlight from column).
    • Force regeneration for atom coordinates, even if the molecule is defined as a MOLBLOCK (Regenerate coords).
    • Type of the filter for that column (sketcher or categorical)(Filter type).
Info pane options

Some info panes can customized. To reveal an info pane's available options, hover over it:

  • View and/or edit the underlying script (click the Script icon).
  • Change parameters (click the Parameter icon).
  • Change the info pane's settings (click the Gear icon).
  • Append info pane as a column (click the More actions icon and select Add as a column).

Customize-info-panes

To learn how to customize and extend the platform programmatically, see the Develop section of our documentation.

developers

Info panes can be extended with functions in any supported language.

Chemically aware spreadsheet

The spreadsheet lets you visualize, edit, and efficiently work with chemical structures. Additionally, you can add new columns with calculated values or visualizations from info panes or context actions. The features also include the ability to interactively filter rows, color-code columns, pin rows or columns, set edit permissions, and more. To learn how to work with the spreadsheet, see this article or complete this tutorial.

Chemical spreadsheet

Chemically aware viewers

Datagrok viewers recognize and display chemical data. The viewers were built from scratch to take advantage of Datagrok's in-memory database, enabling seamless access to the same data across all viewers. They also share a consistent design and usage patterns. Any action taken on one viewer, such as hovering, selecting, or filtering, is automatically applied to all other viewers, creating an interconnected system ideal for exploratory data analysis.

In addition to the chemical spreadsheet, examples of viewers include a scatterplot, a network diagram, a tile viewer,a bar chart, a form viewer, and trellis plot, and others. All viewers can be saved as part of the layout or a dashboard. Some viewers offer built-in statistics.

To learn how to use viewers to explore chemical data, complete this tutorial or visit the Visualize section of our documentation.

developers

Sketching

note

Datagrok provides integrations with various sketchers, but their availability depends on their licensing options. Some sketchers require a commercial license from the vendor before they can be used in Datagrok. To use these sketchers, you must provide a path to the license server.

How to provide a license path
  1. Go to Manage > Packages and select a package that you want to use.
  2. Navigate to the Context Panel > Settings and enter the license code in the License path field.
  3. Click SAVE to activate the license.

To launch a sketcher, double-click a molecule. Alternatively, in the Actions info pane, select Sketch.

You can sketch a molecule or retrieve one by entering SMILES, compound identifier, or a common name (like aspirin). The following compound identifiers are natively understood since they have a prefix that uniquely identifies the data source: SMILES, InChI, InChIKey, ChEMBL, MCULE, comptox, and zinc (example: CHEMBL358225). The rest of the 30+ identifier systems can be referenced by prefixing the source name followed by a colon to the identifier (example: 'pubchem:11122').

Supported identifier systems
actordrugbanklipidmapspubchem
atlasdrugcentralmculepubchem_dotf
bindingdbemoleculesmetabolightspubchem_tpharma
brendafdasrsmolportrecon
carotenoiddbgtopdbnih_nccrhea
chebihmdbnikkajiselleck
chemblibmnmrshiftdb2surechembl
chemicalbookkegg_ligandpdbzinc
comptoxlincspharmgkb
developers

You can register a function to instruct Datagrok to retrieve molecules from a database using custom identifiers. To do so, annotate the function with the following parameters: --meta.role: converterand --meta.inputRegexp:. These annotations provide the necessary instructions for Datagrok to recognize the function as a converter and handle the input appropriately.

Example

The following code snippet defines a converter function that retrieves the SMILES representation of a molecule from the ChEMBL database using the provided ChEMBL identifier. The output is the canonical SMILES string representing the molecule.

--name: chemblIdToSmiles
--friendlyName: Converters | ChEMBL to SMILES
--meta.role: converter
--meta.inputRegexp: (CHEMBL[0-9]+)
--connection: Chembl
--input: string id = "CHEMBL1185"
--output: string smiles { semType: Molecule }
--tags: unit-test
select canonical_smiles from compound_structures s
join molecule_dictionary d on s.molregno = d.molregno
where d.chembl_id = @id
--end

From the Hamburger (☰) menu, you can access the following features:

  • Copy a molecule as SMILES or MOLBLOCK.
  • View recently sketched structures.
  • View/add to favorites.
  • Switch between available sketchers. Datagrok remembers your preferred sketcher type, so you don't have to select it every time you use it.

Sketcher features

Sketchers are synchronized with the Context Panel. As you draw or edit a molecule, the info panes on the right dynamically update with the information about the sketched compound.

Substructure search / Filtering

Datagrok offers an intuitive filtering functionality to explore and filter datasets. Hovering over categories or distributions in the Filter Panel instantly highlights relevant data points across all viewers. For molecules, Datagrok uses integrated sketchers to allow structure-based filtering. After applying the filter, Datagrok highlights the queried substructures in the filtered subset.

Filter by substructure

How to use

To filter by substructure, follow these steps:

  1. On the Menu Ribbon, click the Filter icon to open the Filter Panel. The panel shows filters for all dataset columns. By default, the substructure filter is displayed on top, but you can rearrange, add, or remove filter columns by using available controls.
  2. Open the sketcher by clicking the Click to edit button. Sketch or enter a substructure.
  3. Once finished, click OK to apply the filter.

To clear the filter, use the checkbox provided. To remove the filter altogether, use the Close (x) icon.

To learn more about filtering, watch this video or read this article.

Datagrok offers two analytical tools to help you analyze a collection of molecules based on molecular similarity: similarity search and diversity search. Similarity search finds structures similar to the reference molecule, while diversity search shows N molecules of different chemical classes presented in the dataset. Both tools are based on fingerprints, with the customizable distance metric.

Available distance metrics
  • Tanimoto
  • Dice
  • Cosine

To sort a dataset by similarity, right-click your reference molecule and select Current value > Sort by similarity. The dataset is sorted with the reference molecule pinned at the top.

To explore the dataset further, use the similarity and diversity viewers (Top Menu > Chem > Search > Similarity Search... or Diversity Search...). The viewers are interactive and let you quickly switch between the molecules of interest.

Similarity and diversity search
How to use

To configure a similarity or diversity viewer, click the Gear icon at the viewer's top. The Context Panel updates with available controls. You can change parameters like the similarity threshold, fingerprint type, or the distance measure.

By default, a reference molecule follows the current row. If you click a different molecule, the similarity viewer updates accordingly. To lock in a specific reference molecule, clear the Follow Current Row control. To sketch a custom reference molecule, click the Edit icon on the reference molecule card.

You can enhance the viewer cards by incorporating column data. To do so, use the Molecule Properties control. If a column is color-coded, its format is reflected on the card's value. To adjust how the color is shown (either as a background or text), use the Apply Color To control. To remove highlighting, clear the color-coding from the corresponding column in the dataset.

similarity_search_add_fields

Chemical space

Datagrok lets you analyze chemical space using distance-based dimensionality reduction algorithms, such as tSNE and UMAP. These algorithms use fingerprints to convert cross-similarities into 2D or 3D coordinates. This allows to visualize the similarities between molecular structures and identify clusters of similar molecules, outliers, or patterns that might be difficult to detect otherwise. The results are visualized on the interactive scatter plot.

chem-space

How to use
  1. Go to the Menu Ribbon and choose Chem > Analyze Structure > Chemical Space... This opens a Chemical Space parameter dialog.
  2. In the dialog, select the source of the molecule and set the parameters such as the distance metric (Tanimoto, Asymmetric, Cosine, Sokal) and the algorithm you want to use. To change the default settings for the selected algorithm, click the Gear icon next to the Method name control.
  3. Click OK. A scatterplot is added to the view.

Structure analysis

R-groups analysis

R-group analysis decomposes a set of molecules into a core and R-groups (ligands at certain attachment positions), and visualizes the results. The query molecule consists of the scaffold and ligand attachment points represented by R-groups.

R-Group Analysis
How to use
  1. Go to Chem > Analyze SAR > R-Groups Analysis... A sketcher opens.
  2. In the sketcher, specify the common core (scaffold) for the selected molecular column using one of these methods:
    • Draw or paste a scaffold in the sketcher.
    • Click MCS to automatically identify the most common substructure.
  3. Click OK to execute. The R-group columns are added to the dataframe, along with a trellis plot for visual exploration.

The trellis plot initially displays pie charts. To change the chart type, use the Viewer control in the top-left corner to select a different viewer.

If you prefer not to use a trellis plot, close it or clear the Visual analysis checkbox during Step 2. You can manually add it later. You can also use other chemical viewers, like scatterplot, box plot, bar chart, and others.

developers

To run the r-group decomposition programmatically, see this sample script.

Scaffold tree analysis

Scaffold tree organizes molecular datasets by arranging molecules into a tree hierarchy based on their scaffolds. This hierarchy can then be used for filtering or selecting the corresponding rows in the dataset.

A hierarchy can be either generated automatically, or sketched manually. To access, in the Menu Ribbon select Chem > Analyze > Scaffold Tree. When Scaffold Tree is initially created, a tree is generated automatically. You can also sketch the tree manually, or modify the automatically generated one. For tree generation, we use a derivative of the open-source ScaffoldGraph library developed by Oliver Scott.

note

Scaffold tree generation is computationally intensive and may take a significant time.

scaffold-tree-generate-edit

A scaffold tree consists of scaffolds and orphan nodes. Each scaffold should contain its parent scaffold as a substructure. Orphans are molecules that contain the parent scaffold but do not contain any of the sibling scaffolds. A picture best illustrates this:

Scaffold tree anatomy

Sketch or modify a scaffold tree
To manually sketch or modify the scaffold tree, use these controls:
  • To clear the tree, in the Toolbar, click the Drop all trees (trash) icon.
  • To add a new root node, in the Toolbar, click the Add new root structure (+) icon. This opens a molecular sketcher.
  • To add a new scaffold under an existing one, click the Add new scaffold (+) icon next to the scaffold. Alternatively, right-click the molecule and select Add New....
  • To delete a scaffold along with its children, right-click it and select Remove.
  • To edit a scaffold, click the Edit... icon next to the scaffold. Alternatively, right-click the molecule, and select Edit.... This opens a molecular sketcher.

In addition, you can download a file with the scaffold tree to your local drive. To do so, hover over any scaffold tile and click the Download icon in the Toolbar. To load a previously saved tree, hover over any scaffold tile and click the Open (+) icon in the Toolbar, then select the saved file.

scaffold-tree-controls

Scaffold tree is a viewer, which means it's synchronized with other viewers, can be shared or saved as part of a layout or a dashboard, or used to filter a dataset.

Filtering and selection

To highlight rows matching a particular scaffold, hover the mouse over the scaffold tile. To select rows matching a particular scaffold, click the Select rows icon next to the scaffold. To deselect rows matching a particular scaffold, click the Deselect rows icon next to the scaffold. The state is picked up by other viewers.

scaffold-tree-filter-select

To filter a dataset using a scaffold tree, do the following:

  • To filter by a particular scaffold exclusively, select the checkbox to the left of the molecule. This action automatically clears any other checkboxes and helps to navigate quickly within the dataset.
  • To add another scaffold to the filtered subset, select the corresponding checkbox. Use the AND/OR control in the Toolbar to set the desired logical operation. If needed, you can invert the function of an individual checkbox to exclude the scaffold from the subset instead of adding it. To invert the checkbox mode, click the Doesn't equal icon located in the top left corner of the scaffold tile. The change in the state of an individual checkbox doesn't affect the state of other checkboxes.
  • To clear all filters, click the Filter icon in the Toolbar.

scaffold-tree-controls

You can add the scaffold tree as a filter to the Filter Panel:

  1. In the Menu Ribbon, click the Filter icon to toggle the Filter panel.
  2. In the top left corner, click the Hamburger (☰) icon > Add External > Scaffold Tree Filter. A dialog opens.
  3. In the dialog, select the molecular column and click OK. A scaffold tree tile is added to the Filter Panel.
  4. To filter, click the Add (+) icon, then paste or draw a scaffold using a sketcher.

Scaffold tree filter panel

You can use a scaffold tree with other filters, where each filter eliminates rows that do not meet the filtering criteria.

Elemental analysis

Elemental Analysis analyzes the elemental composition of a molecular structure and visualizes the results in a radar viewer. Each point on the chart represents an element, and the distance from the center of the chart to the point indicates the relative abundance of that element in the structure. Use it as a basic tool to explore the dataset and detect rare elements and unique properties.

How to use
  1. In the Menu Ribbon, open the Chem menu and select Analyze structure > Elemental Analysis... A parameter input dialog opens.
  2. In the dialog:
    1. Select the source table and the molecular column that you want to analyze.
    2. Select the desired visualization option. You can choose between a standalone viewer (select Radar View) and sparklines (select Radar Grid), both of which use a radar viewer.
  3. Click OK to execute the analysis. New columns with atom counts and molecule charges are added to the spreadsheet and plotted on a radar chart using the selected visualization option.

Structure relationship analysis

Activity cliffs

The Activity Cliffs tool in Datagrok detects and visualizes pairs of molecules with highly similar structures but significantly different activity levels, known as "activity cliffs." This tool uses distance-based dimensionality reduction algorithms (such as tSNE and UMAP) to convert cross-similarities into a 2D scatterplot.

Activity cliffs

How to use
  1. Open the Chem menu and select Analyze SAR > Activity Cliffs.
  2. In the parameter input dialog, specify the following:
    1. Select the source table, molecular column, and activity data column to analyze.
    2. Set the similarity cutoff.
    3. Select a dimensionality reduction algorithm and adjust its parameters using the Gear icon next to the Method control.
  3. Click OK to execute the analysis. A scatterplot visualization is added to the view.
  4. Optional. In the scatterplot, click the link with the detected number of cliffs to open an Activity Cliffs table containing all pairs of molecules identified as cliffs. The tables also have detailed information such as similarity score, activity difference score, and other data.

In the scatterplot, the marker color corresponds to the level of molecule activity, and the size represents the maximum detected activity cliff for that molecule. The opacity of the line connecting molecule pairs corresponds to the size of the activity cliff.

To explore the molecule pairs:

  • Click a molecule in the source dataframe to zoom in on the scatterplot and focus on the pair that includes the selected molecule. Hover over molecule pairs and connecting lines to see summary information about them.
  • Click the line connecting molecules in the scatterplot to select a corresponding pair of molecules in the underlying dataframe and Activity Cliffs table. The reverse also applies: clicking a pair in the Activity Cliffs table updates the scatterplot and selects the corresponding rows in the underlying dataframe.

As you browse the dataset, the Context Panel updates with relevant information.

Matched molecular pairs

note

This feature is in Beta.

The Matched Molecular Pairs ("MMP") tool lets you explore chemical space and identify structural transformation rules that can be used to improve potency and ADMET properties during lead optimization. This tool automatically detects matched molecule pairs in your dataset and calculates the difference in property or activity values between them. The mean change in property or activity values across your dataset represents the expected size of the change when the transformation is applied to a molecule.

The results of the MMP analysis are presented in a series of tables and visualizations, allowing you to:

  • View fragments and substitutions in your dataset
  • Analyze the effect of specific fragments on the chosen activity or property of a lead compound
  • Generate new molecules based on the transformations present in your dataset and view their predicted properties and activities.
How to use

To run MMP analysis:

  1. In the Top Menu, select Chem > Analyze > Matched Molecular Pairs... A dialog opens.
  2. In the dialog, select the table you want to analyze (Table), the column containing molecules within this table (Molecules), and the activity/property columns (Activity). Click OK. An MMP section is added to the view. It has four tabs:

The Transformations tab has two tables:

  • The upper table shows all fragment substitutions found in the dataset for the current molecule. It includes the frequency of each substitution and the corresponding change in the analyzed activity or property.
  • The lower table shows all pairs of molecules associated with the substitution from the upper table. It provides details about the analyzed activity or property for each pair of molecules.

QSAR and QSPR modeling

Datagrok lets you easily train, apply, and manage structure-based predictive models using these modeling methods:

  1. Classical models (such as XGBoost) that work on the calculated descriptors.
  2. Cheminformatics-specific models, such as chemprop.
Train

Train a model based on a measured response using calculated descriptors as features. Use an integrated model building mechanism that supports different backends and dozens of models with hundreds of hyperparameters available. You may try it as a tutorial and walk through an illustrative example of virtual screening exercise.

Training

Apply
Augment

A simple yet efficient way to deploy models is through the use of info panes. These panes provide predicted values that dynamically update as you interact with a chemical structure (e.g., upon clicking, modifying, or sketching a molecule.) This approach enables quick access to model predictions, enhancing the user experience and facilitating insights.

Augmenting

Chemical scripts

Chem package comes with several scripts that can be used either directly, or as an example for building custom chemical functions in languages such as Python (with RDKit) or R. These chemical functions can be integrated into larger scripts and workflows across the platform, enabling a variety of use cases such as data transformation, enrichment, calculations, building UI components, workflow automation, and more. Here's an example:

Gasteiger partial charges script

In this example, a Python script based on RDKit calculates and visualizes Gasteiger partial charges. When you run the script explicitly, Datagrok shows a dialog for sketching a query molecule and visualizes the results. In this case, however, the script is also tagged as a panel. This instructs Datagrok to show the results as an interactive UI element that updates dynamically for the current molecule.

To view the chemical scripts you've created or those shared with you, open the Scripts Gallery (Functions > Scripts) and filter by the tag #chem. You can search for individual scripts and use the Context Panel to view details, edit, run, manage, and perform other actions for the selected script.

note

For a full list of chemical scripts, along with details on their implementation and associated performance metrics, see Chemical scripts. To learn more about scripting, see Scripting.

Utilities

Datagrok offers multiple ways to transform and enrich your data. For example, you can link tables, extract values, or add metadata to annotate your dataset with experimental conditions or assay results. You can also use chemical scripts to execute operations on chemical data, including calculation of fingerprints and descriptors, toxicity prediction, and more.

Calculators

Molecular descriptors and fingerprints

Datagrok supports the calculation of different sets of descriptors and fingerprints:

  • Lipinski, Crippen, EState, EState VSA, Fragments, Graph, MolSurf, QED. See this reference article for additional details.
  • RDKFingerprint, MACCSKeys, AtomPair, TopologicalTorsion, Morgan/Circular. See this reference article for additional details.

For individual molecules, descriptors are calculated in real-time and presented in the Descriptors info pane (Context Panel > Chemistry > Descriptors), which dynamically updates as you interact with a chemical structure (e.g., upon clicking, modifying, or sketching a molecule.) You can also calculate descriptors or fingerprints for the entire column by choosing the corresponding option from the Chem > Calculate menu.

Molecule identifier conversions

Datagrok supports the conversion of various molecule identifiers, including proprietary identifiers, allowing you to work with multiple data sources and tools. For example, you can convert a SMILES string to an InChI and vice versa.

Supported data sources
actordrugbanklipidmapspubchem
atlasdrugcentralmculepubchem_dotf
bindingdbemoleculesmetabolightspubchem_tpharma
brendafdasrsmolportrecon
carotenoiddbgtopdbnih_nccrhea
chebihmdbnikkajiselleck
chemblibmnmrshiftdb2surechembl
chemicalbookkegg_ligandpdbzinc
comptoxlincspharmgkb

For individual molecules, the conversion happens automatically as you interact with molecules, and the Context Panel shows all available identifiers in the Identifiers info pane (Context Panel > Structure > Identifiers). You can also convert the entire column by choosing the corresponding option from the Chem > Calculate menu.

developers

To run programmatically, use the #{x.ChemMapIdentifiers} function.

Curation

Datagrok supports chemical structure curation, including kekulization, normalization, reorganization, neutralization, tautomerization, and the selection of the main component.

How to use

To perform chemical structure curation:

  1. Navigate to Menu Ribbon > Chem > Transform > Curate.
  2. In the CurateChemStructures dialog, select from the available options and click OK. This action adds a new column containing curated structures.

Curation

Mutation

You can generate a dataset based on the preferred structure.

How to use

To perform chemical structure mutation:

  1. Navigate to Menu Ribbon > Chem > Transform > Mutate.
  2. In the Mutate dialog, draw or paste the desired structure and set other parameters, including the number of mutated molecules. Each mutation step can have randomized mutation mechanisms and places (select the Randomize checkbox).
  3. Click OK to execute. A new table with mutated structures opens.

Virtual synthesis

You can use the Chem: TwoComponentReaction function to apply specified chemical reactions to a pair of columns containing molecules in a virtual synthesis workflow. The output table contains a row for each product yielded by the reaction for the given inputs.

Reactions

How to use
  1. Open the Two Component Reaction dialog by executing the Chem: TwoComponentReaction function in the Console. This opens a parameter input dialog.
  2. In the dialog:
    1. Select the reactants to use.
    2. Enter reaction in the filed provided.
    3. Choose whether to combine the reactants from two sets, or sequentially, and whether to randomize, by checking or clearing the Matrix Expansion and Randomize checkboxes.
    4. Set other parameters, such as seed, the number of maximum random reactions.
    5. Click OK to execute.

Customizing and extending the platform

Datagrok is a highly flexible platform that can be tailored to meet your specific needs and requirements. With its comprehensive set of functions and scripting capabilities, you can customize and enhance any aspect of the platform to suit your chemical data needs.

For instance, you can add new data formats, apply custom models, and perform other operations on molecules. You can also add or change UI elements, create custom connectors, menus, context actions, and more. You can even develop entire applications on top of the platform or customize any existing open-source plugins.

Learn more about extending and customizing Datagrok, including this cheminformatics-specific section.

Resources