Define semantic type detectors
Datagrok helps you get the most out of your data by encoding its meaning into semantic types. In addition to the types that are automatically detected by the platform, you can define your own semantic types. This can serve multiple purposes: from custom rendering and viewers to info panels and predictive models.
Function annotation
Semantic types detectors have a tag semTypeDetector
, take a single column
as input, and return either a string
containing the semantic type or null
. Their names typically start with
the detect
prefix, e.g., detectNucleotides
or detectRDSmiles
. These functions should live in detectors.js
, a special file inside
your package. There you simply define a class named <package_name>PackageDetectors
that
subclasses DG.Package
and add one or more detectors:
class SequencePackageDetectors extends DG.Package {
//tags: semTypeDetector
//input: column col
//output: string semType
detectNucleotides(col) {
if (col.name.startsWith('nuc')) {
col.semType = 'nucleotides';
return col.semType;
}
return null;
}
}
Notice how the column properties are used. Such properties as col.name
help determine a semantic type,
but col.semType
actually assigns the detected type to the column.
To get a template for a detector function, run this command from your package directory:
grok add detector <semantic-type-name>
A column can have only one semantic type. In cases when several detectors match a column, its semantic type depends on the order in which the detectors were triggered. Standard platform detectors do not necessarily get a preference.
Advanced checks
Semantic type detectors are uploaded separately from the rest of a package
(therefore, you might not see webpack warnings about syntax errors in
detectors.js
). Datagrok calls detectors each time the user opens a table. To quickly inspect the data, these functions
have to be lightweight, simple, and efficient. Often checking the data type (col.type
) and applying regular
expressions to the column name (col.name
) will be enough for such tests. However, there are ways to carry out a more
advanced check. For instance, the following example makes use of column statistics:
class ViewersPackageDetectors extends DG.Package {
//tags: semTypeDetector
//input: column col
//output: string semType
detectMagnitude(col) {
if ((col.type === DG.TYPE.FLOAT || col.type === DG.TYPE.INT) &&
(0 < col.min && col.max < 10) && col.name.toLowerCase() === 'magnitude') {
col.semType = 'Magnitude';
return col.semType;
}
return null;
}
}
Apart from col.min
and col.max
, there are other
descriptive statistics
calculated for a column.
If the columns you are working with contain a lot of unique categories, use a special sampling
method DG.Detector.sampleCategories()
to run your checks on a random subset of column values (see an example in the
public package
NglViewer).
Empty values deserve special attention: make sure you don't match them with your semantic type, for example, empty strings should not match your regular expressions. Otherwise, this type might be assigned to a column consisting solely of nulls, which won't add any helpful insights to the data profile.
Detectors test
Every Semantic type detector has its own test in Datagrok, so when you create a detector you have to be aware of its test. Usually, Datagrok uses
specified test data for detector tests, but sometimes this data doesn't fit the detector. In this case, you can easily skip tests for your detector by using the meta.skipTest
tag.
//tags: semTypeDetector
//input: column col
//output: string semType
//meta.skipTest: #2596, Fix for test data in the utils library
detectPdb(col) {
if (DG.Detector.sampleCategories(col,
// (s) => s.includes('COMPND') && s.includes('ATOM') && s.includes('END'), 1)
(s) => s.match(/^COMPND/m) && s.match(/^END/m) &&
(s.match(/^ATOM/m) || s.match(/^HETATM/m)),
)) {
col.meta.units = 'pdb';
return 'Molecule3D';
} else if (DG.Detector.sampleCategories(col,
(s) => s.match(/^MODEL/m) && s.match(/^ENDMDL/m) &&
(s.match(/^ATOM/m) || s.match(/^HETATM/m)),
1)
) {
col.meta.units = 'pdbqt';
return 'Molecule3D';
}
return null;
}
Also, if you have a specified dataset in the detector package, you can use it to test the detector. You need to set //meta.testData
to define the dataset for the detector.
There should be only one column that fits the detector. If you want to ensure that the detector chooses it correctly you can use //meta.testDataColumnName
to set the column's name for the test.
//tags: semTypeDetector
//input: column col
//output: string semType
//meta.testData: pdb_data.csv
//meta.testDataColumnName: Molecule3D
detectPdb(col) {
if (DG.Detector.sampleCategories(col,
// (s) => s.includes('COMPND') && s.includes('ATOM') && s.includes('END'), 1)
(s) => s.match(/^COMPND/m) && s.match(/^END/m) &&
(s.match(/^ATOM/m) || s.match(/^HETATM/m)),
)) {
col.meta.units = 'pdb';
return 'Molecule3D';
} else if (DG.Detector.sampleCategories(col,
(s) => s.match(/^MODEL/m) && s.match(/^ENDMDL/m) &&
(s.match(/^ATOM/m) || s.match(/^HETATM/m)),
1)
) {
col.meta.units = 'pdbqt';
return 'Molecule3D';
}
return null;
}
See also: