Cheminformatics concepts
Please see the 20-minute cheminformatics introduction video .
Wikipedia: Cheminformatics is the use of computer and informational techniques applied to a range of problems in the field of chemistry. These in silico techniques are used, for example, in pharmaceutical companies and academic settings in the process of drug discovery. These methods can also be used in chemical and allied industries in various other forms.
But what does in silico actually means? It is counterposed to in vitro (undertaken in glass without cells) and in vivo (undertaken in cell cultures or organisms) biochemical experiments. So, in silico was defined to imply a biochemical experiment performed via computational approaches (as silicon is the main component of modern CPU).
Cheminformatics is an in silico discipline serving to handle chemical entities for a number of different purposes:
- generate possible chemical structures all possible or with synthetic rules implementation;
- storing compounds collections ;
- facilitation of chemists' work via visualization and reaction prediction;
- prediction of physical and chemical properties of the compounds (QSPR) and biological activities of the compounds (QSAR) ;
- ADMET prediction;
- virtual screening to evaluate the most potent drug candidates.
Storing chemical information
The most convenient form of molecular representation for human is molecular graph, and Datagrok indeed fulfills this convenience to its users. Check it at smiles.csv, double-click the molecular graph to modify it. Mathematically, graphs are sets of vertices and edges, and could be processed by machines in different ways.
To store the whole information about the molecule, the MOL format is widely used. It contains atoms (vertices) and bonds (edges) of the molecule with all associated information as atom coordinates, charges, isotopes etc. Multiple MOL files are stored as an SDF where any other additional information could be added (e.g. experimental activity values). Being very convenient, MOL is also commonly used in the overwhelming majority of computational cheminformatics software.
The tightest popular format for the storage is SMILES string. Special rules are used for such string generation. Datagrok uses SMILES to restore MOL files for subsequent processing. For chemical reactions, the modified strings of SMIRKS format are used.
MOL and SMILES are possibly the best formats for storing a single molecule. However, some applications require the notion of uncertain chemical structures (e.g. fragments that do not correspond to any real molecule). For such needs, logical expressions are essential, (e.g. carbon or oxygen atom, double or aromatic bond) and SMARTS format provides this option. One may say that SMARTS is the same thing as regular expressions in cheminformatics. Datagrok uses SMARTS for finding structural alerts , performing substructure search, R-group analysis (revision required).
Operating with chemical information
Though molecular graph is an appropriate form of data for storage it is useless for the wide spectrum of cheminformatics applications especially machine learning procedures. For such applications, molecules are represented as a set of molecular descriptors or molecular fingerprints. Datagrok provides generation of different sets of descriptors and fingerprints . The objective of these representations is to satisfy linear algebra requirements applied in the majority of ML methods and provide a vector of values describing the molecule. Vector (linear) spaces based on molecular descriptors are called chemical spaces.
Descriptors are frequently used for proceeding similar chemical structures. These principles yield in similarity search and diversity search . In combination with clustering and self-organizing maps, methods such as stochastic proximity embedding allow to reduce dimensionality of used vectors and to separate the most significant features of the molecule. It helps us to visualize the chemical space in 2D maps.
Pharmaceutical needs demand wide use of cheminformatics methods for chemical datasets exploration analysis and following modelling studies. These datasets are always accompanied with experimental values (e.g. measured biological activity of a compound). One of the most common task is evaluation of structure-activity relationships, which are essential in drug development as they contribute in hit compound identification and lead compound optimization. (Q)SAR studies are performed to find possible leads in the screening datasets. (revision required - possible field of Datagrok interests, prediction of solubility, activity cliffs analysis are currently implemented)
All described methods are implemented in different analysis pipelines, and assume that descriptors describe a real molecule perfectly. Data-associated errors lead to biases in descriptors, wrong interpretations of modeling outputs, and irrelevance of the whole work. The most sensitive cases are duplicated vectors in the training set, and errors derived from the incorrect structure representation. Thus, curation of chemical data is usually integrated to analysis pipeline.
Freely available tools and public databases
There is a number of freely-available cheminformatics toolkits available for investigators and developers. RDKit , CDK, Open Babel are among them. RDKit is widely used at Datagrok in form of Python, C++ and JS code.
To provide an opportunity of having up-to-date data for its users, Datagrok is connected to Pubchem, ChEMBL and other databases that contain continuously updated chemical structures data associated with biological experimental values.
Cheminformatics limitations
Though it might seem that cheminformatics covers all molecular needs, it has its limits. The applicability of this discipline is limited to small molecules and some types of peptides. It shows real power when it is combined with other in silico approaches including but not limited to:
- docking and molecular dynamics
- systems biology and pharmacology
- bioinformatics
- proteomics.
See also: