Regular machine

Datagrok is based on Docker containers, database and persistent file storage.

Like a regular machine, any bare-metal server or virtual machine, including virtual machines in cloud providers, for example, AWS EC2, can be used.

As database Datagrok supports any PostgreSQL database out-of-the-box, including cloud solutions for PostgreSQL database, for example AWS RDS.

For persistent file storage, Datagrok supports a lot of options, including cloud solutions, for example AWS S3 and Local File System storage.

This document contains instructions to deploy Datagrok using Docker Compose on AWS EC2 virtual machines with AWS RDS as database and Local File System for persistent storage. This instruction does not cover load balancer creation, which is recommended for production usage: one load balancer for Datagrok components and one for CVM components. However, you can use nginx as load balancers in bare metal or on-premise cases.

More information about Datagrok design and components:

In case you want to jump-start using Datagrok with minimum manual effort on a local machine, check Local Deployment with Docker Compose.

Prerequisites

We use native Docker compose commands to run applications on machines. It simplifies multi-container application development and deployment.
1. Download and install the latest version of Docker Compose to your local machine
Additional components: instance, database, storage, etc., can be created using AWS CLI. To perform AWS CLI commands provided in the document
1. Install AWS CLI
2. Configure authorization for AWS CLI

Preparations

The below example contains steps to create EC2 instances as a virtual machine with public IP association. In your case, it can be any virtual machine. Also, Load Balancers for each VM can be used instead of public IP addresses.

Generate SSH key to access virtual machines

ssh-keygen -t rsa -N '' -m PEM -C 'Datagrok SSH Key' -f ~/.ssh/datagrok-deploy.pem

Import keypair to AWS . Skip this stage if you do not use AWS EC2.

aws ec2 import-key-pair --key-name datagrok-deploy --public-key-material fileb://~/.ssh/datagrok-deploy.pem.pub

Create VPC for Datagrok EC2 Instances. Do the steps following the links or apply the code below in AWS CLI. Skip this stage if you do not use AWS EC2.

Create VPC

aws ec2 create-vpc --cidr-block '10.0.0.0/17' --output text --query Vpc.VpcId

Create Subnet in VPC

aws ec2 create-subnet --vpc-id "<VPC_ID_FROM_1_STEP>" --cidr-block '10.0.0.0/24' --output text --query Subnet.SubnetId

Create and attach internet Gateway to Subnet in VPC

aws ec2 create-internet-gateway --output text --query InternetGateway.InternetGatewayId
aws ec2 attach-internet-gateway --vpc-id "<VPC_ID_FROM_1_STEP>" --internet-gateway-id "<IGW_ID_FROM_1_COMMAND>"

Create route table with a public route for Subnet in VPC

aws ec2 create-route-table --vpc-id "<VPC_ID_FROM_1_STEP>" --output text --query RouteTable.RouteTableId
aws ec2 associate-route-table --subnet-id "<SUBNET_ID_FROM_2_STEP>" --route-table-id "<ROUTE_TABLE_ID_FROM_1_COMMAND>"
aws ec2 create-route --route-table-id "<ROUTE_TABLE_ID_FROM_1_COMMAND>" --destination-cidr-block 0.0.0.0/0 --gateway-id "<IGW_ID_FROM_3_STEP>"

Create Security Group for EC2 instances. Skip this stage if you do not use AWS EC2.

Create Security Group

aws ec2 create-security-group --group-name datagrok-sg --description "Datagrok SG" --vpc-id <VPC_ID_FROM_4_STAGE>

Add a rule for inbound SSH traffic . Note that this rule allows access worldwide

aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 22 --cidr 0.0.0.0/0

Add a rule for inbound traffic for Datagrok (8080) and CVM (8090, 5005, 54321). Note that these rules allow access worldwide.

aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 8080  --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 8090  --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 5005  --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 54321 --cidr 0.0.0.0/0

Create a virtual machine for Datagrok components. Requirements: 2 vCPU and 4 GB RAM.
1. Create EC2 instance for Datagrok VM.
  1. Choose AMI with any system which is preferable for you
  2. Press Next
  3. Choose an Instance Type t3.medium
  4. Press Next
  5. Choose created in 4th stage VPC for Network
  6. Choose created in 4th stage VPC for Subnet
  7. Auto-assign Public IP: Enable
  8. Press Next
  9. Set Size for Storage to 20 GiB
  10. Press Next
  11. We are okay with default Tags. Press Next
  12. Select an existing security group. And check the security group created in the 5th stage: datagrok-sg
  13. Review and Launch
  14. Launch
  15. Choose an existing key pair which we imported on the 3rd stage: datagrok-deploy
  Or do it from AWS CLI:
```
aws ec2 run-instances --image-id ami-092cce4a19b438926 --block-device-mappings 'Ebs={VolumeSize=20}' --network-interfaces 'AssociatePublicIpAddress=true' --count 1 --instance-type t3.medium --key-name datagrok-deploy --security-group-ids <SG_ID_FROM_5_STAGE> --subnet-id <SUBNET_ID_FROM_4_STAGE>
```
Create a virtual machine for CVM components. Requirements: 4 vCPU and 8 GB RAM.
1. Create EC2 instance for Compute VM.
  1. Choose AMI with any Linux OS which is preferable for you
  2. Press Next
  3. Choose an Instance Type c5.xlarge
  4. Press Next
  5. Choose created in 4th stage VPC for Network
  6. Choose created in 4th stage VPC for Subnet
  7. Auto-assign Public IP: Enable
  8. Press Next
  9. Set Size for Storage to 100 GiB
  10. Press Next
  11. We are okay with default Tags. Press Next
  12. Select an existing security group. And check the security group created in the 5th stage: datagrok-sg
  13. Review and Launch
  14. Launch
  15. Choose an existing key pair which we imported on the 3rd stage: datagrok-deploy
  Or do it from AWS CLI:
```
aws ec2 run-instances --image-id ami-092cce4a19b438926 --block-device-mappings 'Ebs={VolumeSize=100}' --network-interfaces 'AssociatePublicIpAddress=true' --count 1 --instance-type c5.xlarge --key-name datagrok-deploy --security-group-ids <SG_ID_FROM_5_STAGE> --subnet-id <SUBNET_ID_FROM_4_STAGE>
```
Configure virtual machines
1. Log in to machines
  1. Use the private key created in the first stage for EC2 instances
2. Install Docker on virtual machines
3. Add login user to docker group on virtual machines
```
sudo usermod -a -G docker <login_user>
```

Create PostgreSQL 12 database for Datagrok

Create Security Group for EC2 instances. Skip this step if you do not use AWS RDS. You can do it from AWS CLI:

aws ec2 create-security-group --group-name datagrok-rds-sg --description "Datagrok RDS SG" --vpc-id <VPC_ID_FROM_4_STAGE>

Create RDS instance for Datagrok. Skip this step if you do not use AWS RDS.

DB instance identifier: datagrok-rds
Dev/Test Template
Master username: postgres
Master password and Confirm password: postgres
DB instance class: Burstable classes: db.t3.medium
Allocated storage: 50
Enable storage autoscaling
Maximum storage threshold: 100
Do not create a standby instance
Choose created in 4th stage VPC for Virtual private cloud (VPC)
Create a new DB Subnet Group for the Subnet group
Public access: No
VPC security group: Choose existing: select the security group created in the 1st step: datagrok-rds-sg

You can do it from AWS CLI:

aws rds create-db-subnet-group \
--db-subnet-group-name "datagrok-rds" \
--db-subnet-group-description "DB subnet group for datagrok-rds" \
--subnet-ids "['<SUBNET_ID_FROM_4_STAGE>']"
aws rds create-db-instance \
--db-instance-identifier "datagrok-rds" \
--db-name "datagrok" \
--engine 'postgres' \
--engine-version '12.9' \
--auto-minor-version-upgrade \
--allocated-storage 50 \
--max-allocated-storage 100 \
--db-instance-class 'db.t3.medium' \
--master-username "postgres" \
--master-user-password "postgres" \
--port "5432" \
--no-publicly-accessible \
--storage-encrypted \
--deletion-protection \
--backup-retention-period 3 \
--output text --query 'DBInstance.[DBInstanceIdentifier, DBInstanceStatus]'

Copy Database address

Copy RDS endpoint

aws rds describe-db-instances --db-instance-identifier "datagrok-rds" --output text --query 'DBInstances[].[DBInstanceStatus, Endpoint.Address]'

Locally create docker context for virtual machines:

docker context create --docker 'host=ssh://<DATAGROK_VM_IP_ADDRESS>:22' datagrok
docker context create --docker 'host=ssh://<CVM_VM_IP_ADDRESS>:22' cvm

Download Docker Compose YAML file: link.

Setup Datagrok components

Switch to the datagrok context docker context use datagrok

In downloaded localhost.docker-compose.yaml replace in GROK_PARAMETERS value with

{
  "dbServer": "<DATABASE_SERVER>",
  "dbPort": "5432",
  "db": "datagrok",
  "dbLogin": "datagrok",
  "dbPassword": "SoMeVeRyCoMpLeXpAsSwOrD",
  "dbAdminLogin": "postgres",
  "dbAdminPassword": "postgres"
}

Run Datagrok deploy. Wait for the deployment process to complete.
```
COMPOSE_PROFILES=datagrok docker-compose --project-name datagrok up -d
```
NOTE: Datagrok provides demo databases with demo data for the full experience. If you want to try datagrok with demo data run the following command instead.
```
COMPOSE_PROFILES=datagrok,demo docker-compose --project-name datagrok up -d
```
Check if Datagrok started successfully: http://<DATAGROK_VM_IP_ADDRESS>:8080, login to Datagrok using the username "admin" and password "admin".
Switch back to default docker context:
```
docker context use default
```

Setup CVM components

Switch to the datagrok context docker context use cvm
Run Datagrok deploy. Wait for the deployment process to complete.
```
COMPOSE_PROFILES=cvm docker-compose --project-name cvm up -d
```
Edit settings in the running Datagrok platform (Tools -> Settings...). Do not forget to click Apply to save new settings.
- Scripting:
  - CVM Url: http://<CVM_VM_IP_ADDRESS>:8090
  - CVM URL Client: http://<CVM_VM_IP_ADDRESS>:8090
  - H2o Url: http://<CVM_VM_IP_ADDRESS>:54321
  - API Url: http://<DATAGROK_VM_IP_ADDRESS>:8080/api
  - Cvm Split: true
- Dev:
  - CVM Url: http://<CVM_VM_IP_ADDRESS>:8090
  - Cvm Split: true
  - API Url: http://<DATAGROK_VM_IP_ADDRESS>:8080/api
Switch back to default docker context:
```
docker context use default
```

Users access

Both Compute and Datagrok engines should be accessible by users. The easiest way is to create DNS endpoints pointing to public IPs or load balancers in front of the services: datagrok.example and cvm.example.

Useful links

Server configuration properties

Prerequisites​

Preparations​

Setup Datagrok components​

Setup CVM components​

Users access​

Useful links​