Cheminformatics
These programming exercises are designed to introduce developers to the Datagrok platform cheminformatics capabilities. The exercises are based on your knowledge obtained in exercises.
Table of contents
Basic exercises in cheminformatics
Exercise 1: Search for most common structures
You will learn: How to employ functions from external packages in your own package.
Prerequisites: "Cheminformatics".
Statement of the problem. Write a function that reads a file containing SMILES, determines the associated maximal common substructure (MCS), and computes the mutual similarity scores for molecules and the MCS.
Input data. Files > App Data > Chem > sars_small.csv
Solution, step-by-step.
-
Let's call our function
findSimilarToMCS, we place its definition in./src/package.tsinside our package. This function takes a single input — a dataframedf. For the sake of simplicity, we suppose that the column with SMILES isdf.col('smiles')://name: findSimilarToMCS
//input: dataframe df
export async function findSimilarToMCS(df: DG.DataFrame) : Promise<void> {
... // your code goes here
} -
Employ the asynchronous function
FindMCSfromChempackage. Since we're calling a function from an external package, we should usegrok.functions.call:const mcsValue = await grok.functions.call('Chem:FindMCS', {'smiles': 'smiles', 'df': dataframe, 'returnSmarts':
false}); -
Having obtained the string
mcsValue, create a new column indf, whose cells are filled with this value:- Create an
Arrayof the appropriate length, filled withmcsValue. - Feed this array to the constructor
DG.Column.fromList()to get the desiredmcsColobject. - Assign semantic type
Moleculeto the newly created column, with the help ofcol.semType(...). Similarly, associateMoleculecell renderer with the help ofcol.setTag(...)method.
- Create an
-
To compute similarity scores, we can call the
getSimilarities()function ofChempackage, which takes as its parameters the initial SMILES column andmcsValue. The function can be invoked as described in step 2. -
The output of step 4 is a new dataframe
scoresDf, its 0-th column contains the scores values. This column,scoresCol, can be reached with by means ofbyIndex()method ofscoresDf.columnsobject. -
Finally, insert the columns
mcsColandscoresColinto the dataframe, next to the position of the initial SMILES column.df.columns.insert()method can help with this, if we cleverly specify the index/position at which the insertion should take place.
Exercise 3: Train Model to Predict Activity
You will learn: How to train a model inside a package and use it to predict the activity of molecules
Prerequisites: "Molecular fingerprints", "Cheminformatics".
-
Create a package with the name
<yourFirstName>-cheminformatics -
Add new function
// name: TrainAndPredict
//input: dataframe train
//input: dataframe test
export function TrainAndPredict(train, test) {
// your code here
}Here the training and test dataframes are our datasets for training and prediction, respectively.
-
Using grok.chem.descriptors create fingerprint of all molecules.
-
Use grok.ml.trainModel your model (using fingerprint) to predict activity of molecule. You can use dataset example
-
Using grok.ml.applyModel apply on the test and train datasets. Check the accuracy of the model.
-
Using grok.shell.addTableView(datasetName) output test dataset