# Data science

Datagrok was built for data scientists, by data scientists. Our goal is to let scientist focus on science, not infrastructure.

Out of the box, Datagrok provides all tools necessary for data ingestion, transformation, visualization, analysis, modeling, as well as deploying models and scientific analyses. Scripts and models can be written in any language, such as R or Python.

## Data munging

80% of data science time is spent retrieving and cleaning the data. By natively integrating with the data retrieval mechanisms and having built-in collaboration features, we drastically reduce that time.

## Reproducibility

- Version control for data
- Version control for code
- Capturing metadata about OS, dependencies, CPU loads, etc

## Data provenance

Data provenance is the ability to fully understand everything that the result depends upon.

## Data pipelines

Data pipelines is a core component of the Datagrok platform designed to let end users define jobs that would get data from disparate data sources, clean or merge the data if needed, run transformations, build interactive dashboards based on the retrieved data, and publish these dashboards.

## Statistical hypothesis testing

Available hypothesis tests:

- Welch's t-test: #{x.TTest}
- Kolmogorov–Smirnov test: #{x.KSTest}

Return p-values.

Tests are available on Context Panel in "Commands" section for two selected numerical columns without missing values. Or from Functions browser "Help | Actions", see "Math or Statistics" sections.

## Normalization

Available normalizations:

Any numerical columns can be normalized via Context Panel in "Commands" section. Or from Functions browser "Help | Actions", see "Math" section.

## Interactive methods

### Clustering

Performs clustering using k-means algorithm. An interactive visualization lets you see clustering results in real-time, which makes the process a lot more intuitive.

### Missing Value Imputation

Allows to do fast and simple missing values imputation using k-nearest neighbours algorithm.

### Random Data Generation

Generate columns with random data with different distributions (Normal, Log-Normal, Binomial, Poisson, Uniform), using the specified parameters.

### Multivariate analysis

Multivariate Analysis plugin implements partial least squares (PLS) algorithm. It is an easy-to-interpret, commonly used approach for multidimensional data analysis. It shows the following on the interactive viewers: scores, explained variance, correlation loadings, predicted vs. reference, and regression coefficients.

### Predictive modeling

Train models, apply them, compare performance characteristics, deploy, share. Currently, there two ways to train models:

- Using build-in plugin for modelling based on R Caret via OpenCPU. It allows to train models:
- SVM (linear or radial)
- Random Forests
- GBM

- Using H2O. A model can be built using H2O UI and than exported into the platform in POJO format and
used from the "Model browser". Supports the following models:
- Deep Learning (Neural Networks)
- Distributed Random Forest (DRF)
- Generalized Linear Model (GLM)
- Gradient Boosting Machine (GBM)
- Naïve Bayes Classifier
- K-Means Clustering
- Principal Component Analysis (PCA)

Trained models can be shared with other users. In addition to making them discoverable and reusable, the platform might also suggest applying models to the datasets ( potentially opened by other users) when it deduces that the input dataset is of the same structure as the dataset the model was trained on.

See also: