Predictive modeling

Predictive modeling uses statistics to predict outcomes.

Predictive Modeling

Algorithms

Predictive models can be used either directly to estimate a response (outcome) given a defined set of characteristics ( features), or indirectly to drive the choice of decision rules.

Plugin uses following types of kernels:

To select kernel open Tools | Settings... | Servers

Set Use Open Cpu For Predictive Modeling to use R Caret, H2O will be used instead.

Available models

For Caret:

Method	Model
rf	Random Forest
gbm	Stochastic Gradient Boosting Machine
svmLinear	Support Vector Machines with Linear Kernel
svmRadial	Support Vector Machines with Radial Basis Function Kernel

For H2O:

Method	Model
Auto ML	Automatic model builder choosing optimal from GLM, DRF, GBM or DeepLearning
Deep Learning	Deep Learning (Neural Networks)
Distributed Random Forest	DRF
Generalized Linear Model	GLM
Gradient Boosting Machine	GBM
Naive Bayes Classifier
K-Means Clustering
Principal Component Analysis	PCA

Note: "K-Means Clustering" and "Principal Component Analysis" do not require prediction column since provides only output.

Train model

Example for R Caret engine:

Open table
Run from menu: Tools | Predictive modeling | Train
Set model name
Select table that contains features
Select feature columns
Select outcome column
Set checkbox to impute missing values, if required
Set number of nearest neighbors to predict missing values, if required
Select modeling method. See Available classification models for description
Set percentage of train rows from table rows, 0.1..100
Run model training

Apply model

Open table
Run from menu: Tools | Predictive modeling | Apply
Select table that contains features
Select applicable model
Set checkbox to impute missing values, if required
Set number of nearest neighbors to predict missing values, if required
Apply model

Also apply model available through "Models Browser" (Tools | Predictive modeling | Browse Models) or as suggested models in table properties in Toolbox or Context Panel.

Outputs

Outcome

Result of modelling (train or apply) will be concatenated to source table as column with name " Outcome".

Roc curve

Receiver operating characteristic curve, i.e. ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate ( FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out.

ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.

Confusion matrix

Confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of " classes" in both dimensions (each combination of dimension and class is a variable in the contingency table).

Deployment

By itself, building a good model typically does not have a lot of value, but sharing the gained knowledge does. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the data and on the requirements, the results could be presented as a data table, a report, an interactive visualization, or something else.

Datagrok platform was specifically designed with that in mind. In addition to traditional model deployment techniques such as table and reports, Datagrok offers a unique way of distributing predictive model results via the data augmentation and info panels.

Algorithms​

Available models​

Train model​

Apply model​

Outputs​

Outcome​

Roc curve​

Confusion matrix​

Deployment​

Videos​