Regular machine
Datagrok is based on Docker containers, database and persistent file storage.
Like a regular machine, any bare-metal server or virtual machine, including virtual machines in cloud providers, for example, AWS EC2, can be used.
As database Datagrok supports any PostgreSQL database out-of-the-box, including cloud solutions for PostgreSQL database, for example AWS RDS.
For persistent file storage, Datagrok supports a lot of options, including cloud solutions, for example AWS S3 and Local File System storage.
This document contains instructions to deploy Datagrok using Docker Compose on AWS EC2 virtual machines with AWS RDS as database and Local File System for persistent storage. This instruction does not cover load balancer creation, which is recommended for production usage: one load balancer for Datagrok components and one for CVM components. However, you can use nginx as load balancers in bare metal or on-premise cases.
More information about Datagrok design and components:
In case you want to jump-start using Datagrok with minimum manual effort on a local machine, check Local Deployment with Docker Compose.
Prerequisites
- We use native Docker compose commands to run applications on machines. It
simplifies multi-container application development and deployment.
- Download and install the latest version of Docker Compose to your local machine
- Additional components: instance, database, storage, etc., can be created using AWS CLI. To perform AWS CLI commands provided in the document
Preparations
The below example contains steps to create EC2 instances as a virtual machine with public IP association. In your case, it can be any virtual machine. Also, Load Balancers for each VM can be used instead of public IP addresses.
-
Generate SSH key to access virtual machines
ssh-keygen -t rsa -N '' -m PEM -C 'Datagrok SSH Key' -f ~/.ssh/datagrok-deploy.pem
-
Import keypair to AWS . Skip this stage if you do not use AWS EC2.
aws ec2 import-key-pair --key-name datagrok-deploy --public-key-material fileb://~/.ssh/datagrok-deploy.pem.pub
-
Create VPC for Datagrok EC2 Instances. Do the steps following the links or apply the code below in AWS CLI. Skip this stage if you do not use AWS EC2.
-
aws ec2 create-vpc --cidr-block '10.0.0.0/17' --output text --query Vpc.VpcId
-
aws ec2 create-subnet --vpc-id "<VPC_ID_FROM_1_STEP>" --cidr-block '10.0.0.0/24' --output text --query Subnet.SubnetId
-
Create and attach internet Gateway to Subnet in VPC
aws ec2 create-internet-gateway --output text --query InternetGateway.InternetGatewayId
aws ec2 attach-internet-gateway --vpc-id "<VPC_ID_FROM_1_STEP>" --internet-gateway-id "<IGW_ID_FROM_1_COMMAND>" -
Create route table with a public route for Subnet in VPC
aws ec2 create-route-table --vpc-id "<VPC_ID_FROM_1_STEP>" --output text --query RouteTable.RouteTableId
aws ec2 associate-route-table --subnet-id "<SUBNET_ID_FROM_2_STEP>" --route-table-id "<ROUTE_TABLE_ID_FROM_1_COMMAND>"
aws ec2 create-route --route-table-id "<ROUTE_TABLE_ID_FROM_1_COMMAND>" --destination-cidr-block 0.0.0.0/0 --gateway-id "<IGW_ID_FROM_3_STEP>"
-
-
Create Security Group for EC2 instances. Skip this stage if you do not use AWS EC2.
-
aws ec2 create-security-group --group-name datagrok-sg --description "Datagrok SG" --vpc-id <VPC_ID_FROM_4_STAGE>
-
Add a rule for inbound SSH traffic . Note that this rule allows access worldwide
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 22 --cidr 0.0.0.0/0
-
Add a rule for inbound traffic for Datagrok (8080) and CVM (8090, 5005, 54321). Note that these rules allow access worldwide.
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 8080 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 8090 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 5005 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id <SG_ID_FROM_1_STEP> --protocol tcp --port 54321 --cidr 0.0.0.0/0
-
-
Create a virtual machine for Datagrok components. Requirements: 2 vCPU and 4 GB RAM.
-
Create EC2 instance for Datagrok VM.
- Choose AMI with any system which is preferable for you
- Press Next
- Choose an Instance Type
t3.medium
- Press Next
- Choose created in 4th stage VPC for Network
- Choose created in 4th stage VPC for Subnet
- Auto-assign Public IP: Enable
- Press Next
- Set Size for Storage to 20 GiB
- Press Next
- We are okay with default Tags. Press Next
- Select an existing security group. And check the security group created in the 5th stage: datagrok-sg
- Review and Launch
- Launch
- Choose an existing key pair which we imported on the 3rd stage: datagrok-deploy
Or do it from AWS CLI:
aws ec2 run-instances --image-id ami-092cce4a19b438926 --block-device-mappings 'Ebs={VolumeSize=20}' --network-interfaces 'AssociatePublicIpAddress=true' --count 1 --instance-type t3.medium --key-name datagrok-deploy --security-group-ids <SG_ID_FROM_5_STAGE> --subnet-id <SUBNET_ID_FROM_4_STAGE>
-
-
Create a virtual machine for CVM components. Requirements: 4 vCPU and 8 GB RAM.
-
Create EC2 instance for Compute VM.
- Choose AMI with any Linux OS which is preferable for you
- Press Next
- Choose an Instance Type
c5.xlarge
- Press Next
- Choose created in 4th stage VPC for Network
- Choose created in 4th stage VPC for Subnet
- Auto-assign Public IP: Enable
- Press Next
- Set Size for Storage to 100 GiB
- Press Next
- We are okay with default Tags. Press Next
- Select an existing security group. And check the security group created in the 5th stage: datagrok-sg
- Review and Launch
- Launch
- Choose an existing key pair which we imported on the 3rd stage: datagrok-deploy
Or do it from AWS CLI:
aws ec2 run-instances --image-id ami-092cce4a19b438926 --block-device-mappings 'Ebs={VolumeSize=100}' --network-interfaces 'AssociatePublicIpAddress=true' --count 1 --instance-type c5.xlarge --key-name datagrok-deploy --security-group-ids <SG_ID_FROM_5_STAGE> --subnet-id <SUBNET_ID_FROM_4_STAGE>
-
-
Configure virtual machines
-
Log in to machines
- Use the private key created in the first stage for EC2 instances
-
Install Docker on virtual machines
-
Add login user to docker group on virtual machines
sudo usermod -a -G docker <login_user>
-
-
Create PostgreSQL 12 database for Datagrok
-
Create Security Group for EC2 instances. Skip this step if you do not use AWS RDS. You can do it from AWS CLI:
aws ec2 create-security-group --group-name datagrok-rds-sg --description "Datagrok RDS SG" --vpc-id <VPC_ID_FROM_4_STAGE>
-
Create RDS instance for Datagrok. Skip this step if you do not use AWS RDS.
- DB instance identifier: datagrok-rds
- Dev/Test Template
- Master username: postgres
- Master password and Confirm password: postgres
- DB instance class: Burstable classes:
db.t3.medium
- Allocated storage: 50
- Enable storage autoscaling
- Maximum storage threshold: 100
- Do not create a standby instance
- Choose created in 4th stage VPC for Virtual private cloud (VPC)
- Create a new DB Subnet Group for the Subnet group
- Public access: No
- VPC security group: Choose existing: select the security group created in the 1st step: datagrok-rds-sg
You can do it from AWS CLI:
aws rds create-db-subnet-group \
--db-subnet-group-name "datagrok-rds" \
--db-subnet-group-description "DB subnet group for datagrok-rds" \
--subnet-ids "['<SUBNET_ID_FROM_4_STAGE>']"
aws rds create-db-instance \
--db-instance-identifier "datagrok-rds" \
--db-name "datagrok" \
--engine 'postgres' \
--engine-version '12.9' \
--auto-minor-version-upgrade \
--allocated-storage 50 \
--max-allocated-storage 100 \
--db-instance-class 'db.t3.medium' \
--master-username "postgres" \
--master-user-password "postgres" \
--port "5432" \
--no-publicly-accessible \
--storage-encrypted \
--deletion-protection \
--backup-retention-period 3 \
--output text --query 'DBInstance.[DBInstanceIdentifier, DBInstanceStatus]' -
Copy Database address
-
Copy RDS endpoint
aws rds describe-db-instances --db-instance-identifier "datagrok-rds" --output text --query 'DBInstances[].[DBInstanceStatus, Endpoint.Address]'
-
-
-
Locally create docker context for virtual machines:
docker context create --docker 'host=ssh://<DATAGROK_VM_IP_ADDRESS>:22' datagrok
docker context create --docker 'host=ssh://<CVM_VM_IP_ADDRESS>:22' cvm -
Download Docker Compose YAML file: link.
Setup Datagrok components
-
Switch to the datagrok context
docker context use datagrok
-
In downloaded
localhost.docker-compose.yaml
replace inGROK_PARAMETERS
value with{
"dbServer": "<DATABASE_SERVER>",
"dbPort": "5432",
"db": "datagrok",
"dbLogin": "datagrok",
"dbPassword": "SoMeVeRyCoMpLeXpAsSwOrD",
"dbAdminLogin": "postgres",
"dbAdminPassword": "postgres"
} -
Run Datagrok deploy. Wait for the deployment process to complete.
COMPOSE_PROFILES=datagrok docker-compose --project-name datagrok up -d
NOTE: Datagrok provides demo databases with demo data for the full experience. If you want to try datagrok with demo data run the following command instead.
COMPOSE_PROFILES=datagrok,demo docker-compose --project-name datagrok up -d
-
Check if Datagrok started successfully:
http://<DATAGROK_VM_IP_ADDRESS>:8080
, login to Datagrok using the username "admin
" and password "admin
". -
Switch back to default docker context:
docker context use default
Setup CVM components
-
Switch to the datagrok context
docker context use cvm
-
Run Datagrok deploy. Wait for the deployment process to complete.
COMPOSE_PROFILES=cvm docker-compose --project-name cvm up -d
-
Edit settings in the running Datagrok platform (Tools -> Settings...). Do not forget to click Apply to save new settings.
- Scripting:
- CVM Url:
http://<CVM_VM_IP_ADDRESS>:8090
- CVM URL Client:
http://<CVM_VM_IP_ADDRESS>:8090
- H2o Url:
http://<CVM_VM_IP_ADDRESS>:54321
- API Url:
http://<DATAGROK_VM_IP_ADDRESS>:8080/api
- Cvm Split:
true
- CVM Url:
- Dev:
- CVM Url:
http://<CVM_VM_IP_ADDRESS>:8090
- Cvm Split:
true
- API Url:
http://<DATAGROK_VM_IP_ADDRESS>:8080/api
- CVM Url:
- Scripting:
-
Switch back to default docker context:
docker context use default
Users access
Both Compute and Datagrok engines should be accessible by users.
The easiest way is to create DNS endpoints pointing to public IPs or load balancers in front of the
services: datagrok.example
and cvm.example
.