Skip to main content

Dataframe

DataFrame is a tabular structure with strongly-typed columns of various types. A dataframe class DG.DataFrame is used in virtually any Datagrok extension or application. It operates via a columnar in-memory data engine which Datagrok implemented from scratch to support highly efficient operation with data in a modern browser together with fast in-browser data visualizations.

Dataframe JavaScript API

Dataframe stores data as list of columns. Constructing, modifying and efficiently accessing data points are embodied in both DG.Column and DG.DataFrame classes. Event handling, visual aspects of working with dataframes, fast column selection, handy construction methods and row-based access are provided in DG.DataFrame. Instances of DG.ColumnList, DG.RowList, DG.Row and DG.Cell are used as related properties or functions return values of DG.Column and DG.DataFrame.

Dataframe design

A Datagrok dataframe reminds of functionally similar structures in Python and R.

In comparison, Datagrok implementation is considerably optimized. While it isn't possible to control the browser entirely, Datagrok in-memory engine is designed in such way that all data-related operations, such as aggregations or statistical computations, are performed efficiently on modern machines.

Here are a few of unique dataframe design features:

  • Built to allocate little memory space, utilizes adaptive bit storage
  • Operates on raw data instead of JavaScript wrapping objects
  • Utilizes manual memory management with the column-based layout
  • Doesn't have blocking structures, works well in multithreaded environments
  • Made to work entirely in the browser, but same dataframe code also runs on Datagrok servers, e.g. for serialization
  • Uses custom serialization codecs for efficiency

Datagrok visualizations sit on top of the dataframe, that's one of the key reasons for their high performance.

DG.Column

Columns support the types specified in DG.COLUMN_TYPE enum: STRING, INT, FLOAT , BOOL, DATE_TIME, BIG_INT, QNUM, DATA_FRAME and OBJECT. Find the details of handling these types further.

Column properties

Every column has:

  • a .name: set it at construction, change at runtime
  • a .length: it's a read-only property
  • a .type: one of DG.COLUMN_TYPE, possible to change later via DG.DataFrame class
  • a parent .dataFrame, if the column is a part of some dataframe
  • .temp and .tags: provide access to special proxy classes for named temporary values and named tags which can be associated with any column

.tags play key role in platform extensibility, allowing to introduce custom behavior and new data types.

A column which is already constructed, either by an initialization from a list, or being part of some dataframe from its creation, isn't a subject to changing its length, if taken standalone. This is due to its memory-optimized nature. However, the columns length is changed with adding rows to the embodying dataframe.

Semantic type

A property .semType is a string value representing a tag which associates a column data with a logical type.

An underlying raw data type of a molecule or a peptide is usually just a string. Semantic typing allows expanding the platform with such logical types recognition and understanding with a help of 3-rd party Datagrok packages. For example, Chem package adds visualizing and handling molecules, including detecting that a string column is of type Molecule, computing molecular fingerprints, searching for substructures, ranking by similarity.

The property .semType is a getter to a tag named 'quality'. This distinct getter is there due to a key role of this tag.

Learn more about semantic types in this article: Link.

Construct a column

  • The most common way is to use a method .fromList, explicitly specifying a type and a column name: let col = DG.Column.fromList(DG.COLUMN_TYPE.INT, 'Column Name', [1, 2, 3])
  • The method .fromStrings recognizes and automatically assigns a type: let col = DG.Column.fromStrings('Column Name', ['3.14', '2.71']); // col.type === DG.COLUMN_TYPES.FLOAT
  • To create a column with NULL values of a pre-specified length, use .fromType or concrete types shortcuts: let col = DG.Column.fromType(DG.COLUMN_TYPE.INT, 'Name', 3); // col.get(0) === DG.INT_NULLlet col = DG.Column.string('Name', 5); // col.get(2) === ""

The column, once constructed, may later be added to a dataframe.

Manipulate column values

Access and modify column values

  • A method .get is passed an index i to return an i-th value: const value = column.get(idx)
  • A method .set sets i-th value to x: column.set(i, x)

A pair of methods .getString/.setString work similarly, but with formatting and parsing:

  • .getString returns a value converted to a string taking into account a column's tag format
  • .setString attempts to set an i-th value by converting a provided string to the corresponding strongly-typed value, returns true if a text was successfully parsed and set, otherwise false
const table = grok.data.demo.demog();
const column = table.columns.byName('weight');
column.tags.format = '#.0000';
grok.shell.info(column.getString(217)); // displays '108.7208'
grok.shell.info(column.setString(15, '3.1415')); // displays 'true'
grok.shell.info(column.setString(16, 'non-number')); // displays 'false'

Learn more about supported formats in this article: Link.

Refresh the state after modifying the column

Some column associated structures, such as categories, won't be automatically updated after the column data modification with .setString, .set and similar. This is intentional for performance reasons. After the modification is meant to be completed, it is possible to refresh the column col representation, including categories:

col.compact();

This call compacts the internal column representation. Currently, it only affects string columns where values were modified

Initialize values with a function

It's handy to set all column values in a single batch. This is possible with .init function:

let df = DG.DataFrame.fromCsv(
`x, y, s
1, 2, Point A
3, 5, Point B
`);
df.col('x').init(2); // df.col('x').get(1) === 2
df.col('s').init('No comment'); // df.col('s').get(0) === 'No comment'
df.columns.addNewInt('counter').init((i) => i * 3); // df.col('counter').get(1) === 3
grok.shell.addTableView(df);

Access performance

To learn the typical times it takes to run various column access patterns, run this example. In summary, it is advised to explicitly get a column and its length as separate variables before looping through:

const t = grok.data.demo.demog(100000);
const column = t.columns.byName('age');
const rowCount = t.rowCount;
let sum = 0;
for (let i = 0; i < rowCount; i++)
sum += column.get(i);

Doing a .byName or a t.rowCount call as part of the loop shall incur up to 20x overhead. A column.values() iterator is also available, it is 2-3 times slower than this snippet.

If the fastest access is required for numerical column data, which usually happens in computing new values atop a column, accessing data with a result of calling .getRawData() is advised:

const table = grok.data.demo.demog(100000);
const column = table.columns.byName('age');
const array = column.getRawData();
const rowCount = column.length;
let sum = 0;
for (let i = 0; i < rowCount; i++)
sum += array[i];

It is 4-6 times faster than a snippet from above with explicit column and rowCount.

.getRawData returns a Float32Array for DG.COLUMN_TYPE.FLOAT, Int32Array for DG.COLUMN_TYPE.INT.

Column data types

Special values

In popular languages like JavaScript, Python and R, special values representing missing values, unset values, or values for the undefined results (such as not valid numbers), are known as None , NaN (Not A Number), null, or undefined. Dataframe's type system implements special handling of such values in a way great for performance yet still allowing for these kinds of values to be present.

NULL values for numeric types

For performance reasons, Datagrok type system implements an equivalent of an empty value for numeric types DG.COLUMN_TYPE.INT and DG.COLUMN_TYPE.FLOAT. There are special constants DG.INT_NULL = -2147483648 and DG.FLOAT_NULL = 2.6789344063684636e-34, respectively.

To set a value as a Datagrok NULL, pass either a JavaScript null or a corresponding constant:

let col = DG.Column.fromList(DG.COLUMN_TYPE.INT, 'Name', [314, null, DG.INT_NULL, 143]);
col.set(0, null);
grok.shell.info(col.get(0)); // shows '-2147483648'
grok.shell.info(col.get(1)); // shows '-2147483648'
grok.shell.info(col.get(2)); // shows '-2147483648'

When checking if a value is a Datagrok NULL, be cautious of using === null, as the value isn't a JavaScript null. Instead, use a DG.Column's method .isNone(i) to check if the i-th element is Datagrok NULL. Continuing with the previous example:

grok.shell.info(col.isNone(2)); // shows 'true'
grok.shell.info(col.isNone(3)); // shows 'false'
NULL values in table views
  • In the table view, Datagrok NULL values are seen as empty cells
  • Entering an empty string in a numeric column makes a cell valued to a Datagrok NULL
  • If an empty string is passed to a .setString, a Datagrok NULL value shall be assigned
NULL values for strings

An empty string is an equivalent of a Datagrok NULL:

let col = DG.Column.fromList(DG.COLUMN_TYPE.STRING, 'Name', [null, '']);
grok.shell.info(col.isNone(0)); // shows 'true'
grok.shell.info(col.isNone(1)); // shows 'true'
undefined and NaN values
  • Passing undefined as a column item value is equivalent to passing a null
  • NaN in JavaScript represents a value which is not a valid number: this value isn't currently supported

String, integer, float, boolean

Enum values: DG.COLUMN_TYPE.STRING, DG.COLUMN_TYPE.INT, DG.COLUMN_TYPE.FLOAT , DG.COLUMN_TYPE.BOOL

These are regular JavaScript types. The only difference is handling NULL-values as described above.

Datetime

Enum value: DG.COLUMN_TYPE.DATE_TIME

In future releases, DateTime type becomes a wrapper around a dayJs value.

Bigint

Enum value: DG.COLUMN_TYPE.BIG_INT

The type for working with integers that do not fit into 53 bits.

BIG_INT won't be rendered by a grid viewer and won't become part of aggregations. It is introduced for compatibility with BIG_INT types in databases.

Dataframe

Enum value: DG.COLUMN_TYPE.DATA_FRAME

This example demonstrates how to store dataframes as column values.

Qualified number

Enum value: DG.COLUMN_TYPE.QNUM

Qualified numbers, or QNums, are typically used to represent measurements when the exact value is not known, but it is known that it is either less or greater than some value. The examples of such values are <3.5 or >5E-6.

To keep the performance of this data type on par with INTEGER and FLOAT, QNUM was based on the 64-bit floating point number where two least significant bits of mantissa are reserved for the qualifier. This storage structure comes with ability to efficiently store the numbers and perform arithmetic operations on them without having to incur costly packing/unpacking. The qualifier part is valued to one of: 1 (LESS), 2 (EXACT), or 3 (GREATER). This way, QNums are compared using regular floating point number comparison.

There is no special internal data type, as the values are 64-bit IEEE754 floats. In most cases the isn't need to check whether the value is qualified or not, since the result for the most operations (rendering, using for visualization, arithmetic operations) will be the same. However, in cases it does matter the programmer has to keep track of whether the values are qualified, and pack/unpack accordingly. This is achieved with DG.Qnum class containing helper methods:

let col = DG.Column.qnum('col', 3);
col.set(0, DG.Qnum.greater(5));
col.set(1, DG.Qnum.exact(5));
col.set(1, DG.Qnum.less(5));
grok.shell.info(DG.Qnum.getQ(col.get(0)) == DG.QNUM_GREATER); // shows `true`

Object

Enum value: DG.COLUMN_TYPE.OBJECT

This may be any JavaScript object. Pass it as usual.

Statistics

The properties .min and .max of Column return and cache the minimum and maximum numeric values. Using the internal version of the column, the caches for .min and for .max are automatically invalidated whenever the column is modified.

Computing popular statistics, such as average .avg, median .med or standard deviation .stdev, is more efficient when done altogether. If that's the case you need, it's possible to get all such statistics using a ., which returns an instance of Stats class.

Run Stats example: Link.

Numerical and categorical columns

Currently, columns of types BOOL and STRING are considered as categorical, s.t. they represent values from a discrete set. Columns of types INT and FLOAT are considered as numerical, s.t. they represent values from a continuous set. Later, this will be adjusted s.t. whether column is categorical or numerical will be automatically detected based on its content analysis.

This property mostly affects the way columns are treated in the UI. For instance, this property affects how color coding works.

Work with categories

For a string column, the column collection .categories returns a list of categories in an alphabetical sort. For instance, for a column of strings ['a', 'b', 'b', 'c'] the categories list will be ['a', 'b', 'c'].

It is possible to introduce a custom category order, as this example demonstrates. That is useful if the alphabetical order does not reflect a natural order, for example in ['Low', 'Medium', 'High'].

Versioning

A column increments an integer version number whenever its contents are changed. The .version property returning this number is useful when invalidating cached computation results stored as columns or dataframes. Check this example to see how column versioning works.

DG.DataFrame

Construct

Dataframes may be obtained through the JavaScript or TypeScript code in various ways:

  • a new dataframe constructed from predefined data: Link
  • a new dataframe from precreated Datagrok columns: Link
  • a table already being rendered by a table view
  • a dataframe constructed from a file in a file share
  • a CSV file uploaded to a browser
  • a dataframe returned by a script
  • as calculated on the flight for aggregations

Construct from in-place content

A variety of in-place construction methods are available:

  • Construct from array of explicitly created columns, all same length: link
  • Specify number of rows in a dataframe at creation and later adding columns with this number of items: link
  • Create from a string containing a CSV: link
  • Create from a string containing a JSON: link
  • Create from JavaScript objects: link
  • Create from JavaScript typed arrays: link

Load a demo dataset

Datagrok comes with a hundred demo datasets available as a folder in the platform interface. It's possible to load these datasets programmatically using Datagrok API methods:

  • let table = grok.data.demo.demog() loads the first 10'000 rows of the demographic dataset we used a lot in this article. It's possible to specify a desired number of rows as a parameter, and use one of .biosensor/.wells/.geo functions instead of .demog to load corresponding data
  • let data = grok.data.demo.randomWalk(rows, cols) generates random walk data
  • await table = grok.data.getDemoTable('geo/earthquakes.csv') allows loading any demo dataset from the specified demo datasets file share by its path (the method is asynchronous)

Access

Access columns

Dataframes' columns are accessed and manipulated as a collection by an instance of DG.ColumnList class called .columns. Calling .columns methods is often a starting point to doing anything with a given dataframe:

let d = grok.data.demo.demog();
let len = d.columns.length;
ageColumn = d.columns.byName('age');
ageColumn = d.columns['age'];
ageColumn = d.columns.byIndex(3);
ageColumn = d.columns[3];
ageColumn = d.columns.age;

Note that access by name is case-insensitive.

There are also two shortcuts to get the column by name:

ageColumn = d.col('age');    // won't throw if column isn't found
ageColumn = d.getCol('age'); // will throw if column isn't found

Technically, .columns object is an instance of a special JavaScript wrapper around an instance of ColumnList. While the ColumnList class itself provides for the access methods such as .byIndex or .length, this wrapper adds support for square brackets, access by explicit column name, like .d.columns.age, and other syntactic sugaring of JavaScript such as iterating columns with a for (... of ...) loop:

let df = grok.data.demo.demog();
for (let col of df.columns) {
grok.shell.info(col.name);
}

In addition to regular access to columns by index and name, there's a group of methods covered in this example for retrieving columns Iterable by a specific property:

  • With pre-specified tags:

    let demog = grok.data.demo.demog();
    demog.getCol('race').setTag('tag1', 'value1').setTag('tag2', 'value2');
    // undefined or null means that any value passes
    for (let column of demog.columns.byTags({'tag1': 'value1', 'tag2': undefined}))
    grok.shell.info(column.name);
  • With categorical or numerical columns:

    let demog = grok.data.demo.demog();
    for (let column of demog.columns.categorical) grok.shell.info(column.name);
    for (let column of demog.columns.numerical) grok.shell.info(column.name);

Access rows

As DG.ColumnList provides for column access and manipulation through a .columns property, its counterpart DG.RowList with an instance called .rows allows for rows access and manipulation.

As dataframe is columnar-optimized, it is preferred to work with it in columns-then-rows rather than rows-then-columns fashion. Typically, accessing a dataframe value is based on a column and then row, simply by obtaining a reference to a relevant column and then accessing its elements by their indices. This won't create any intermediate objects besides the ones already existing as part of the dataframe.

If a row-by-row access is desired, that is possible with additional rows materialization. This brings memory and performance overheads, though it may be a convenient tool in cases where dataframes are small — no more than a 100 of rows. An example of iterating through a dataframe row-by-row:

const df = grok.data.demo.demog(5);
for (let row of df.rows) {
grok.shell.info(row.idx);
}

To access rows' values in such iteration, use a collection of DG.Cell named .cells, which is another wrapper around a values stored at a columns' offset:

const df = grok.data.demo.demog(10);
for (const row of df.rows) {
for (const cell of row.cells) {
grok.shell.info(cell.value);
}
}

Manipulate

Manipulate columns

To add, remove or replace columns in a dataframe:

  • for an existing instance column of Column:
    • .add(column) adds it as a last column of the dataframe
    • .insert(column, index) adds it before the column position which is currently at position index, starting count from 0. The default value of index is null, which corresponds to adding the column at the end of the columns list
  • .addNew adds a new empty column at the end of the column list
  • .addNew<TypeName>(name) (.addNewInt, .addNewString etc) adds a new empty column of a type<TypeName> at the end of the column list
  • .addNewVirtual adds a new virtual column at the end of the column list
  • .replace(oldColumn, newColumn) substitutes inside a dataframe an oldColumn passed as an object with a newColumn . The column remains available by oldColumn, but it is no longer part of the dataframe

Examples:

  • run "Add columns": Link
  • run "Manipulate columns": Link

Add a column by a formula

The .columns property of DataFrame provides for a method to create a column by a mathematical formula, specified with a string, which may also involve any function registered within the platform. For example, in a dataframe df with columns X and Y it's possible to add a new column Z, specified as follows:

df.columns.addNewCalculated('Z', 'Sin({X}+${Y})/2');

The column type shall be deduced automatically, or may be specified explicitly as one of this method arguments.

Run "Add calculated columns" example: Link.

Manipulate rows

While it is not generally advised to access the dataframe values row-by-row rather than column-then-offset, it is often useful to expand or shrink dataframes row-by-row. The .rows property enables a set of methods for this:

  • addNew([value1, ..., valueN]) appends a row to the end of the dataframe, filling it with values specified in the passed array, there should be as many values as there are columns. If a null value is passed as an array, an empty row shall be created with the values of Datagrok None

  • .setValues(rowIdx, [value1, ..., valueN]) will set values in the row at the specified offset according to the values in the array

  • .insertAt(idx, count) inserts count new empty rows before a row idx

  • .removeAt(idx, count) removes count rows, starting from a row idx and further

Extend

Join

Aggregate

Pivot

Virtual columns

Consider a person's weight, height and a derived value of a mass index BMI, which is computed as weight / height^2. It is handy to access a BMI value as if it is stored in the dataframe along with other column values. However, it would not be practical to pre-compute these values and actually store in the dataframe, and also be inconvenient to maintain them while the dataframe is being updated.

Virtual columns is a tool to access such computed values of arbitrary type as if they are stored in the dataframe, without physically storing them. For their construction, a callback function taking a row index and returning a scalar value or an object is passed. The platform would call this function whenever a table cell is accessed (for instance, when rendering a grid).

Virtual columns for scalar types

In the example below we are expanding a demog table with the two virtual columns:

  • idx, which only depends on a row's index
  • BMI, which depends on two values of the row

Property .isVirtual indicates if a column is virtual.

let table = grok.data.demo.demog();
table.columns.addNewVirtual('idx',
(i) => i + 1, DG.TYPE.INT);
table.columns.addNewVirtual('BMI',
(i) => table.row(i).weight / Math.pow(table.row(i).height / 100, 2), DG.COLUMN_TYPE.FLOAT);
grok.shell.add(table);
grok.shell.info(table.columns['idx'].isVirtual); // shows 'true`

Access virtual columns in a most standard way:

console.shell.info(table.get('idx', 11)); // displays '12'

Virtual columns for objects

In addition to scalar types, it is possible to store JavaScript objects in columns of type Object. However, if such object is a proxy, domain-specific representation of row's data, most often it is not desired to physically store such objects in the dataframe. A virtual column with a type of DG.COLUMN_TYPE.OBJECT (it is a default value for a type in .addNewVirtual) with a callback function returning a new object is a convenient alternative to an OBJECT -typed Column. Origin data still resides in the most efficient way in the compressed columnar format, at the same time developers have an option to access it using custom, domain-specific objects which get constructed on the fly.

In the example below we expand the demog dataframe with objects of a class PersonalData:

class PersonalData {
constructor(age, state = '', SSN = '') {
this.age = age;
this.state = state;
this.SSN = SSN;
}
toString() {
return `Social data: Age ${this.age}, ${this.state} ${this.SSN}`;
}
}

let table = grok.data.demo.demog();
table.columns.addNewVirtual('personalData', (i) => new PersonalData(table.row(i).age, 'New York'));
grok.shell.info(table.row(1).personalData.state); // will emit 'New York'
grok.shell.add(table);

The platform uses .toString method to render the virtual column contents in the grid.

As such objects are virtual, a direct modification of their fields won't take any effect on the column's data, unless the object class is implemented in such way that it modifies the original dataframe (for example, in a JavaScript property setter).

Filter

Syncing

Custom value comparers

.tags and .temp collections