Setup Gathr on AWS - Manual Deployment

The objective of this topic is to assist the user to install Gathr on AWS environment.

AWS Portal Access

To Setup Gathr, the user should have sufficient privileges to create and manage resources (Resource Group, Virtual machines, Virtual Networks, Subnet, Network Security Group, Gathr Webstudio) in AWS.

Note: Hadoop components are optional, depending upon the requirement, we may need to install the Hadoop services (HDFS, Hive, Yarn, etc.) manually and update the configurations accordingly in Gathr.

Hardware and Software Configurations

The below table provides the system requirements for the Gathr application:

Hardware/Software

Requirement

Machine Type

m5.2xlarge or bigger

Disk Space

30 GB

Operating System

Amazon Linux 2, Centos 7.9, RHEL 7

sudo Access

Required during installation

Internet Access

Optional (preferred during the installation)

AWS Gathr Setup

This section covers the steps to create resources in AWS that are essential to support Gathr application ecosystem.

Prerequisites

The following prerequisites are required for setting up Gathr:

- VPC to launch AWS resources into a virtual network that you have defined.

- Subnets (Public and Private).

- NAT Gateway (Internet access for Private Subnet).

- Elastic IP (In case if Gathr Webstudio is to be accessed publicly).

Create VPC and Subnets

VPC creation is required only if the user does not plan to launch this AMI in an existing VPC.

Note: In case you already have created a VPC and Subnet, you can skip this section.

Even if you do not create a VPC, make sure that the existing VPC has the setup as described below.

Steps to Create VPC

1. Click the Services drop-down and search for VPC

Picture1

2. Click Start VPC Wizard and select VPC with Public and Private Subnets.

Picture2

3. Make sure that the Public and Private subnets are in the same Availability Zone.

Public Subnet that has Internet gateway access for Gathr web interface.

Private Subnet for Gathr application.

4. Create a new Elastic IP for the NAT Gateway.

Picture4

5. Click Create VPC

.Picture5

The Virtual Private Cloud is now created.

Picture6

To know more about how to create a VPC, Subnets and other VPC resources, follow the reference link given below:

Create a VPC and Subnets

IAM Access

This section cover details of IAM roles required to setup Gathr on AWS.

Setup IAM User

An IAM user is required to create an EC2 instance, Security group, VPC, Subnets, S3 bucket, Instance Profile etc.

A user with an AWS root user account has all the access that is necessary to launch Gathr on AWS. Otherwise, you can create an IAM User with the JSON policy.

Setup Role for EMR

You need to create three IAM roles "EMR_AutoScaling_DefaultRole", "EMR_DefaultRole", "EMR_EC2_DefaultRole". These roles will be available as configuration values when you are creating an EMR cluster in Gathr Webstudio.

There are two ways of creating the EMR roles. These are explained below:

- Create EMR Cluster which in-turn creates the required EMR roles.

If you have never created an EMR cluster, then create an EMR cluster in AWS console. It will create the necessary IAM roles in user’s AWS account.

- Create the EMR roles manually.

1. Create IAM Role: "EMR_AutoScaling_DefaultRole" and add the policies to it as shown in the screenshot below:

EMR_AutoScaling_DefaultRole

Next, update the 'Trust Relationship' of the above IAM Role with the content provided below:

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Service": [

"application-autoscaling.amazonaws.com",

"elasticmapreduce.amazonaws.com",

"ec2.amazonaws.com"

]

},

"Action": "sts:AssumeRole"

}

]

}

2. Create IAM Role: "EMR_DefaultRole" and add the given policies to it as shown in the screenshot below:

EMR_DefaultRole

Next, update the 'Trust Relationship' of the above IAM Role with the content provided below:

{

"Version": "2008-10-17",

"Statement": [

{

"Sid": "",

"Effect": "Allow",

"Principal": {

"Service": "elasticmapreduce.amazonaws.com"

},

"Action": "sts:AssumeRole"

}

]

}

3. Create IAM Role: "EMR_EC2_DefaultRole" and add the given policies to it as shown in the screenshot below:

EMR_EC2DefaultRole

Next, update the 'Trust Relationship' of the above IAM Role with the content provided below:

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Service": "ec2.amazonaws.com"

},

"Action": "sts:AssumeRole"

}

]

}

Setup Role for Gathr Webstudio EC2

Create IAM Role "GathrWebstudio_EC2Role" and add the following inline JSON Policy to it:

{

"Version": "2012-10-17",

"Statement": [

{

"Sid": "VisualEditor1",

"Effect": "Allow",

"Action": [

            "ec2:*",

"kms:ListKeyPolicies",

"kms:ListRetirableGrants",

"kms:ListAliases",

"kms:ListGrants",

            "iam:GetPolicyVersion",

"iam:GetPolicy",

"s3:ListAllMyBuckets",

"iam:ListRoles",

            "sts:AssumeRole",

"elasticmapreduce:*"

],

"Resource": "*"

},

{

"Sid": "VisualEditor2",

"Effect": "Allow",

"Action": [

"s3:PutObject",

"s3:GetObject",

"iam:PassRole",

"s3:ListBucket",

"s3:DeleteObject"

],

"Resource": [

"arn:aws:iam::<AWS_Account_ID>:role/EMR_EC2_DefaultRole",

"arn:aws:iam::<AWS_Account_ID>:role/EMR_DefaultRole",

   "arn:aws:iam::<AWS_Account_ID>:role/EMR_AutoScaling_DefaultRole",

   "arn:aws:s3:::<S3_Metadata_Bucket_Name>/*",

   "arn:aws:s3:::<S3_Metadata_Bucket_Name>" ] } ]}

Next, update the 'Trust Relationship' of the above IAM Role with the given content provided below:

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Service": "ec2.amazonaws.com"

},

"Action": "sts:AssumeRole"

}

]

}

Security Groups

Below are the ports required to be opened in VM Security Group:

Mandatory Port/Optional Port

Service

Port

Optional

Zookeeper

2181

Mandatory

Gathr (Non-SSL/SSL)   

8090/8443

Mandatory

SSH

22

Optional

RabbitMQ (Non-SSL/SSL)   

5672,15672/15671

Optional

Elasticsearch   

9200-9300

Optional

PostgreSQL

5432

Create the following security groups:

1. SAX-WebServerSecurityGroup with following permissions:

Inbound permission:

Outbound permission:

Picture8

2. SAX-SAXEMR-SecurityGroup with following permissions:

Inbound permission:

Picture9

The source is same security group "SAX-SAXEMR-SecurityGroup"

Outbound permission:

Picture10

S3 Bucket

The user requires to create a bucket for sax-metadata in the S3 account. Name it according to your organization’s naming standards. Use the same region that is used to launch Gathr EC2 node.

Note: The S3 bucket name is required while configuring Gathr.

Setup Databricks (If using AWS-Databricks)

For running Gathr jobs on Databricks, the user must have Databricks Enterprise account subscription. Launch EC2 Instance using Databricks VPC and Databricks Public subnet, otherwise the user will be required to setup peering between Gathr Instance VPC and Databricks VPC.

Launch EC2 Instance for Gathr Webstudio

This EC2 instance will have all the required services that are essential for Gathr application to run successfully.

To launch the EC2 Instance do as follows:

Choose an AMI

Picture11

Note: Select the AMI of any of the preferred OS: Amazon Linux 2, Centos 7.9, RHEL 7.

Choose Instance Type

Select instance type m5.2xlarge or larger.

Picture12

Configure Instance

VPC: Select a pre-created VPC from drop down.

Subnet: Select pre-created subnet from drop down.

Auto-assign IP: enable

IAM role: Select "GathrWebstudio_EC2Role”Picture13 which you have created earlier.

Click next on Network Interface.

On 'Add Storage' provide 100 GB storage

Picture14.

On 'Add Tags' provide Name to the EC2 instance

Picture15.

On 'Configure Security Group' page Select previously created security group i.e. 'SAX-WebServerSecurityGroup' and 'SAX-SAXEMR-SecurityGroup'

.Picture16

Review settings and Launch instance by providing the PEM file.

Associate Elastic IP address (Optional)

Please select 'eth0' as network interface and select Private IP of the instance.

Once the EC2 instance is up and running, continue from Section 4 to start setting up Gathr.

Install Software

This section describes the steps that user should take to install the prerequisite software on the Virtual Machine that has been launched on Cloud.

Note: Make sure that you have the Gathr bundle for installation from the Gathr support team.

ssh into Gathr VM to continue with the following steps:

Install Java 8

1. Install Java 8:

yum install java-1.8.0-openjdk

yum install java-1.8.0-openjdk-devel

2. Set java home in. bashrc

Get Java Home path by running the following command:

alternatives --config java

/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64/jre/bin/java

export JAVA_HOME=/usr/lib/jvm/ java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64

export PATH=$JAVA_HOME/bin:$PATH

Install RabbitMQ

This is an optional component. However, it is important for pipeline error handling in Gathr.

Install this package before installing RMQ:

yum -y install epel-release

yum -y install erlang socat

1. Download the package:

wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.6.10/rabbitmq-server-3.6.10-1.el7.noarch.rpm

rpm --import https://www.rabbitmq.com/rabbitmq-release-signing-key.asc

rpm -Uvh rabbitmq-server-3.6.10-1.el7.noarch.rpm

2. Start using the below command:

systemctl start rabbitmq-server

3. Enable it with the below command:

systemctl enable rabbitmq-server

4. To check the status, use the below command:

systemctl status rabbitmq-server

5. Enable the plugins with the below command:

sudo rabbitmq-plugins enable rabbitmq_management

6. To create a test user, provide the below command:

rabbitmqctl add_user test test

rabbitmqctl set_user_tags test administrator

rabbitmqctl set_permissions -p / test ".*" ".*" ".*"

Install Zookeeper

Install Zookeeper 3.5.7 as follows:

1. Copy the zookeeper tar file either from sax_bundle.

2. Extract it: tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz

3. Create datadir inside zk installation directory

4. Open /zookeeper-3.5.7/conf and cp zoo_sample.cfg zoo.cfg and edit zoo.cfg

5. Set the IP and Port in zoo.cfg file: server.1=IP:2888:3888

6. Start the zookeeper from /zookeeper-3.5.7/bin with ./zkServer.sh start.

Install Postgres 10

Install Postgres 10 as follows:

1. Install Postgres repo as a root user into the system:

rpm -Uvh https://yum.postgresql.org/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm

2. Install Postgresql10:

yum install postgresql10-server postgresql10

3. Initialize PGDATA:

/usr/pgsql-10/bin/postgresql-10-setup initdb

And start the postgres: systemctl start postgresql-10.service

4. Login into Postgres

su - postgres -c "psql"

:

5. Change the postgres password

postgres=# \password postgres

:

Settings for PostgreSQL

1. Login as a postgres user:

su - postgres.

cd /10/data and edit the pg_hba.conf.

2. Add the IP's in IP4 to allow the permission:

host all all 0.0.0.0/0 md5

host replication postgres 10.1.2.0/24 md5

3. Edit the postgresql.conf and replace listen_address from localhost to *

.

listen_addresses = '*'

4. Restart the postgres:

systemctl restart postgresql-10.service

Install ElasticSearch 6.4.1

This is an optional component; However, it is important as we are using this for monitoring Gathr pipelines.

Install ElasticSearch 6.4.1 as follows:

1. Copy Elasticsearch from Gathr_bundle.

2. Extract the bundle.

3. Open /elasticsearch-6.4.1/conf/elasticsearch.yaml and make these changes:

cluster.name: ES641

node.name: IP of the machine

path.data: /tmp/data

path.logs: /tmp/logs

network.host: IP of the machine

http.port: 9200

discovery.zen.ping.unicast.hosts: ["IP"]

Settings for Elasticsearch

Note: Make sure to increase the max file descriptors [4096] for elasticsearch process to at least [65536] as follows:

Add the line given below in /etc/security/limits.conf

sax soft nofile 65536

sax hard nofile 65536

sax memlock unlimited

Note: Sax is the user from which you are starting Elasticsearch.

Note: Make sure to increase the max virtual memory areas vm.max_map_count [65530] to at least [262144]

Run the following command:

sudo sysctl -w vm.max_map_count=262144

4. Start elastic search in background:

nohup ./elasticsearch &

Install Gathr

Install and run in embedded mode

1. Copy the Gathr tar file from the Gathr_bundle to the Virtual Machine.

2. Extract the tar.

3. Run this command to start Gathr in embedded mode:

cd bin

./startServicesServer.sh -deployment.mode=embedded

Note: Logs are located in <GathrInstallationDir>/logs and <GathrInstallationDir>/server/tomcat/logs directories.

User can check the log files in these directories for any issues during installation process.

4. To open Gathr, http://<Public_IP>:8090/Gathr

5. Accept the End User License Agreement and hit Next button.

Picture17

The Upload License page opens.

Picture18

6. Upload the license and confirm.

Picture19

7. Login page is displayed.

Picture20

Note: Now the user will need to switch Gathr from embedded to cluster mode.

Follow the sections given below in order to switch Gathr from embedded to cluster mode

Login to Gathr using default username & password.

Picture21

Navigate to the Setup >> Gathr and update the below details:

- Gathr Web URL

- Zookeeper Gathr Node

- Zookeeper Configuration Node

Picture22

Navigate to the Setup >> Database and update the below details:

- Connection URL

- User

- Password

- Run Script

Picture23

Note: Please check the Run Script and click on Save it will execute the DDL & DML in Gathr Metastore.

Navigate to the Setup >> Messaging Queue and update the below details:

- Messaging Type

- Host List

- User

- Password

Picture24

Navigate to the Setup >> Elasticsearch and update the below details:

- Elasticsearch Connection URL

- Elasticsearch Cluster Name

Picture25

Zookeeper Configuration

Update the Zookeeper properties in Gathr Configuration with the below mentioned path:

<Gathr_install_dir>/conf/yaml/

Update Zookeeper property in the file “env-config.yaml”:

update_properties

Update the Zookeeper properties in Gathr Configuration with the below mentioned path:

<Gathr_install_dir>/conf/

Update Zookeeper property in the file “config.properties”:

update_property

After updating the details, restart Gathr with -config.reload=true.

Picture26

Cloud Vendor War

Copy Cloud Vendor specific war

Copy Cloud Vendor specific war file into tomcat.

For AWS

cp <Gathrinstallationlocation>/lib/emrservice.war <Gathrinstallationlocation>/server/tomcat/webapps/

For AWS-Databricks

cp <Gathrinstallationlocation>/lib/clusterMediator.war <Gathrinstallationlocation>/server/tomcat/webapps/

The war will get extracted in server/tomcat/webapps. Now stop tomcat and configure application files.

For AWS

cd <Gathrinstallationlocation>/bin

./stopServicesServer.sh

update <Gathrinstallationlocation>/server/tomcat/webapps/emrservice/WEB-INF/classes/application.properties file

spring.datasource.url=jdbc:postgresql://<GathrPrivateIP>:5432/DBNAME

spring.datasource.username=username

spring.datasource.password=password

spring.datasource.driver-class-name=org.postgresql.Driver

For AWS-Databricks

cd <Gathrinstallationlocation>/bin

./stopServicesServer.sh


update <Gathr installation location>/server/tomcat/webapps/cluster-mediator /WEB-INF/classes/application.properties file


spring.datasource.url=jdbc:postgresql://<GathrPrivateIP>:5432/DBNAME

spring.datasource.username=username

spring.datasource.password=password

spring.datasource.driver-class-name=org.postgresql.Driver

Configure Cloud Vendor specific details in yaml

Copy Cloud Vendor specific war file into tomcat

For AWS

Configure AWS details in Yaml.

Open config.yaml file from File: (<GathrInstallationDir>/Gathr /conf/yaml/env-config.yaml)

and append the content given below:

emr:

instance.url: "http://<GathrPrivateIP>:8090/emrservice"

s3.jar.upoadPath: "s3://sax-metadata"

s3.log.uri: "s3://sax-metadata"

isEnabled: "true"

region: "us-west-2"

aws s3 cp <Gathrinstallationlocation>/lib/spark-structured-sax-pipeline.jar s3://sax-metaData/

aws s3 cp $SAX_BUNDLE/init-scripts.sh s3://sax-metaData/

Copy jar & init-scripts on s3

For AWS Databricks

Note: The following steps are required only in case you need AWS Databricks cluster in Gathr application.

Configure Databricks details in yaml

File: (<Gathrinstallationlocation >/Gathr/conf/yaml/env-config.yaml)

Databricks

dbfs.jar.uploadPath: "/sax-databricks-jars "

mediator.address: "http://<GathrPrivateIP>:8090/cluster-mediator/"

isEnabled: "true"

authToken: "<authtoken>"

instanceUrl: "https://<databricks-instance-url>"

Copy jar & init-scripts on DBFS

curl 'https://<databricks-instance>/api/2.0/dbfs/put' -H "Authorization: Bearer <personal-access-token-value>" -F contents=@<Gathr installation location>/lib/spark-structured-sax-pipeline.jar -F path="<sax metadata on dbfs path>/spark-structured-sax-pipeline.jar"


curl 'https://<databricks-instance>/api/2.0/dbfs/put' -H "Authorization: Bearer <personal-access-token-value>" -F contents=@$SAX_BUNDLE/init-scripts.sh -F path="<sax metadata on dbfs path>/init-scripts.sh"

:

Restart Gathr in Cluster Mode and upload license

cd <Gathrinstallationlocation>/bin

./startServicesServer.sh -config.reload=true

:

Basic Sanity

After login with default user (superuser), below steps need to be checked:

1. Validate Default connections.

2. Validate Cluster List View and cluster creation.

Picture42

3. Validate the workspace and project creation.

4. Associate token with the created user.

Go to Manage Users, select user for which we need to associate token.

Picture43

Click on edit icon.

Picture44

Click on this next button.

Now go below Databricks section. And tick on Token Associated Checkbox and select use existing. Enter Azure account mail as Username and the Token inside these boxes.

Picture45

After entering this click on Update Button at right.

Picture46

Now, logout and login as the workspace user you created. Now the user should also see the Cluster List View.

5. Create sample pipeline.

Go to workspace -> project -> pipeline.

Start local session and configure basic pipeline. (e.g. DG->RMQ). Save and exit.

6. Configure job for pipeline.

Select either existing cluster or new cluster to which we want to run this pipeline.

Picture47Picture48

After cluster launch, the pipeline will come in STARTING state then ACTIVE state.

For logs, we can check in Databricks instance URL under Jobs:

Picture49

Select your pipeline name for checking any logs.

After pipeline finish check data at emitter.

Steps to Restart Gathr

In case there are any updates to be done in configurations, you can restart Gathr by providing the below commands:

./startServicesServer.sh -config.reload=true

./stopServicesServer.sh

Steps to Uninstall Gathr

1. Stop/Kill the Bootstrap Process.

2. Delete Gathr installation directory and its dependencies (like RMQ, ZK etc)

3. Delete the Gathr database.