Exploratory data analysis
Before we can learn from data, we need to understand it. Exploratory data analysis (EDA) is a process of performing initial investigation on data to discover patterns, spot anomalies, test hypothesis, and check assumptions.
By its nature, EDA is visually-driven. Most of today's datasets are too big, too complex, and diverse to be explored in a tabular format or by statistical means alone. On the other hand, humans evolved to understand complex information visually and are better than computers at detecting patterns and anomalies.
Interactivity is key. We may not know what we are looking for until we extract knowledge from data and update our understanding as we go. To uncover insights that otherwise may go unnoticed, we need to be able to quickly change both what we are viewing and how we are viewing it:
- Look at data from multiple perspectives at once
- Zoom in and filter
- Manipulate, edit, and add data
- Get details on demand
- Select rows of interest, and see how they compare to other row sets.
From the ground up, we designed Datagrok for visually-driven EDA of big, complex datasets. Unlike other tools that use conventional client-server architecture, Datagrok's proprietary in-memory database makes it possible to analyze millions of columns and billions of rows at the speed of thought right in your browser.
With Datagrok, you can:
Seamlessly load data from any data source. Datagrok supports all popular databases, multiple file formats and is both data-agnostic and domain-intelligent.
Visualize the data using domain-specific value renderers (such as molecules on scatter plot axes).
Analyze big datasets that other tools struggle with (billions of rows, or millions of columns).
Use multiple interactive tools to wrangle data right from your visualization workspace. Cluster data, impute missing values, find and treat duplicates and outliers.
Visualize data at the click of a button using 30+ native viewers. We support all popular visualizations ( like scatterplots with built-in regression lines or box-plots with built-in statistical tests) and certain domain-specific viewers, such as chemically-aware viewers. The tabular viewer, grid, is extremely powerful. Some of its features include:
- Viewing datasets with millions of columns and billions of rows
- Dataset overview, including summary statistics for numerical data columns and distribution for categorical data columns
- Custom cell renderers for molecules, sequences, dose-response curves, and sparklines
- Editing datasets (for example, adding new molecules using sketchers)
Filter, zoom in and out, aggregate, pivot, and cross-link data. All our viewers work in tandem and are customizable, high-performant, and interactive.
Perform calculations on data using predefined statistical functions.
Get details on demand using a variety of widgets, including customizable tooltips and context-driven info panels.
Build on collective knowledge of Datagrok users. Using built-in data augmentation capabilities, Datagrok understands the nature of your data, and offers actionable insights based on it. For example, the platform automatically suggests visualizations for datasets or predicts properties for chemical structures.
You can also leverage Datagrok's component-based architecture to extend or create any element you like. For example, you can add custom viewers or develop new functions in R, Python, or Julia.
Each of these actions can be automated and used in pipelines. Sharing the results of your analysis is easy and secure.
With Datagrok, anyone can use their domain knowledge and perceptive abilities to explore data and uncover its meaning.
Examples
Interactive Data Visualization
An overview of some of the visualization capabilities of the Datagrok platform, including the concepts of views, viewers, selection, filter, and layouts.
Coffee Company
How do we choose the best location for a new coffee place, given the historical sales data? Datagrok to the rescue! In less than 20 minutes, we achieve the following:
• Retrieve historical data from the Postgres database
• Explore, visualize, and clean the dataset
• Impute missing values
• Extract census data from the long/lat coordinates
• Perform multivariate analysis
• Build multiple predictive models, and assess their performance
• Build an interactive map for predicting sales
• Deploy the results as an app to all users in our company