Setup Gathr on AWS - Manual Deployment
The objective of this topic is to assist the user to install Gathr on AWS environment.
AWS Portal Access
To Setup Gathr, the user should have sufficient privileges to create and manage resources (Resource Group, Virtual machines, Virtual Networks, Subnet, Network Security Group, Gathr Webstudio) in AWS.
Note: Hadoop components are optional, depending upon the requirement, we may need to install the Hadoop services (HDFS, Hive, Yarn, etc.) manually and update the configurations accordingly in Gathr.
Hardware and Software Configurations
The below table provides the system requirements for the Gathr application:
Hardware/Software | Requirement |
---|---|
Machine Type | m5.2xlarge or bigger |
Disk Space | 30 GB |
Operating System | Amazon Linux 2, Centos 7.9, RHEL 7 |
sudo Access | Required during installation |
Internet Access | Optional (preferred during the installation) |
This section covers the steps to create resources in AWS that are essential to support Gathr application ecosystem.
The following prerequisites are required for setting up Gathr:
- VPC to launch AWS resources into a virtual network that you have defined.
- Subnets (Public and Private).
- NAT Gateway (Internet access for Private Subnet).
- Elastic IP (In case if Gathr Webstudio is to be accessed publicly).
VPC creation is required only if the user does not plan to launch this AMI in an existing VPC.
Note: In case you already have created a VPC and Subnet, you can skip this section.
Even if you do not create a VPC, make sure that the existing VPC has the setup as described below.
Steps to Create VPC
1. Click the Services drop-down and search for VPC
2. Click Start VPC Wizard and select VPC with Public and Private Subnets.
3. Make sure that the Public and Private subnets are in the same Availability Zone.
Public Subnet that has Internet gateway access for Gathr web interface.
Private Subnet for Gathr application.
4. Create a new Elastic IP for the NAT Gateway.
5. Click Create VPC
.
The Virtual Private Cloud is now created.
To know more about how to create a VPC, Subnets and other VPC resources, follow the reference link given below:
This section cover details of IAM roles required to setup Gathr on AWS.
An IAM user is required to create an EC2 instance, Security group, VPC, Subnets, S3 bucket, Instance Profile etc.
A user with an AWS root user account has all the access that is necessary to launch Gathr on AWS. Otherwise, you can create an IAM User with the JSON policy.
You need to create three IAM roles "EMR_AutoScaling_DefaultRole", "EMR_DefaultRole", "EMR_EC2_DefaultRole". These roles will be available as configuration values when you are creating an EMR cluster in Gathr Webstudio.
There are two ways of creating the EMR roles. These are explained below:
- Create EMR Cluster which in-turn creates the required EMR roles.
If you have never created an EMR cluster, then create an EMR cluster in AWS console. It will create the necessary IAM roles in user’s AWS account.
- Create the EMR roles manually.
1. Create IAM Role: "EMR_AutoScaling_DefaultRole" and add the policies to it as shown in the screenshot below:
Next, update the 'Trust Relationship' of the above IAM Role with the content provided below:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "application-autoscaling.amazonaws.com", "elasticmapreduce.amazonaws.com", "ec2.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] } |
2. Create IAM Role: "EMR_DefaultRole" and add the given policies to it as shown in the screenshot below:
Next, update the 'Trust Relationship' of the above IAM Role with the content provided below:
{ "Version": "2008-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Service": "elasticmapreduce.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } |
3. Create IAM Role: "EMR_EC2_DefaultRole" and add the given policies to it as shown in the screenshot below:
Next, update the 'Trust Relationship' of the above IAM Role with the content provided below:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } |
Setup Role for Gathr Webstudio EC2
Create IAM Role "GathrWebstudio_EC2Role" and add the following inline JSON Policy to it:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "ec2:*", "kms:ListKeyPolicies", "kms:ListRetirableGrants", "kms:ListAliases", "kms:ListGrants", "iam:GetPolicyVersion", "iam:GetPolicy", "s3:ListAllMyBuckets", "iam:ListRoles", "sts:AssumeRole", "elasticmapreduce:*" ], "Resource": "*" }, { "Sid": "VisualEditor2", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "iam:PassRole", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:iam::<AWS_Account_ID>:role/EMR_EC2_DefaultRole", "arn:aws:iam::<AWS_Account_ID>:role/EMR_DefaultRole", "arn:aws:iam::<AWS_Account_ID>:role/EMR_AutoScaling_DefaultRole", "arn:aws:s3:::<S3_Metadata_Bucket_Name>/*", "arn:aws:s3:::<S3_Metadata_Bucket_Name>" ] } ]} |
Next, update the 'Trust Relationship' of the above IAM Role with the given content provided below:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } |
Below are the ports required to be opened in VM Security Group:
Mandatory Port/Optional Port | Service | Port |
---|---|---|
Optional | Zookeeper | 2181 |
Mandatory | Gathr (Non-SSL/SSL) | 8090/8443 |
Mandatory | SSH | 22 |
Optional | RabbitMQ (Non-SSL/SSL) | 5672,15672/15671 |
Optional | Elasticsearch | 9200-9300 |
Optional | PostgreSQL | 5432 |
Create the following security groups:
1. SAX-WebServerSecurityGroup with following permissions:
Inbound permission:
Outbound permission:
2. SAX-SAXEMR-SecurityGroup with following permissions:
Inbound permission:
The source is same security group "SAX-SAXEMR-SecurityGroup"
Outbound permission:
The user requires to create a bucket for sax-metadata in the S3 account. Name it according to your organization’s naming standards. Use the same region that is used to launch Gathr EC2 node.
Note: The S3 bucket name is required while configuring Gathr.
Setup Databricks (If using AWS-Databricks)
For running Gathr jobs on Databricks, the user must have Databricks Enterprise account subscription. Launch EC2 Instance using Databricks VPC and Databricks Public subnet, otherwise the user will be required to setup peering between Gathr Instance VPC and Databricks VPC.
Launch EC2 Instance for Gathr Webstudio
This EC2 instance will have all the required services that are essential for Gathr application to run successfully.
To launch the EC2 Instance do as follows:
Choose an AMI
Note: Select the AMI of any of the preferred OS: Amazon Linux 2, Centos 7.9, RHEL 7.
Choose Instance Type
Select instance type m5.2xlarge or larger.
Configure Instance
VPC: Select a pre-created VPC from drop down.
Subnet: Select pre-created subnet from drop down.
Auto-assign IP: enable
IAM role: Select "GathrWebstudio_EC2Role” which you have created earlier.
Click next on Network Interface.
On 'Add Storage' provide 100 GB storage
.
On 'Add Tags' provide Name to the EC2 instance
.
On 'Configure Security Group' page Select previously created security group i.e. 'SAX-WebServerSecurityGroup' and 'SAX-SAXEMR-SecurityGroup'
.
Review settings and Launch instance by providing the PEM file.
Associate Elastic IP address (Optional)
Please select 'eth0' as network interface and select Private IP of the instance.
Once the EC2 instance is up and running, continue from Section 4 to start setting up Gathr.
This section describes the steps that user should take to install the prerequisite software on the Virtual Machine that has been launched on Cloud.
Note: Make sure that you have the Gathr bundle for installation from the Gathr support team.
ssh into Gathr VM to continue with the following steps:
1. Install Java 8:
yum install java-1.8.0-openjdk yum install java-1.8.0-openjdk-devel |
2. Set java home in. bashrc
Get Java Home path by running the following command:
alternatives --config java |
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64/jre/bin/java export JAVA_HOME=/usr/lib/jvm/ java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64 export PATH=$JAVA_HOME/bin:$PATH |
This is an optional component. However, it is important for pipeline error handling in Gathr.
Install this package before installing RMQ:
yum -y install epel-release yum -y install erlang socat |
1. Download the package:
wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.6.10/rabbitmq-server-3.6.10-1.el7.noarch.rpm rpm --import https://www.rabbitmq.com/rabbitmq-release-signing-key.asc rpm -Uvh rabbitmq-server-3.6.10-1.el7.noarch.rpm |
2. Start using the below command:
systemctl start rabbitmq-server |
3. Enable it with the below command:
systemctl enable rabbitmq-server |
4. To check the status, use the below command:
systemctl status rabbitmq-server |
5. Enable the plugins with the below command:
sudo rabbitmq-plugins enable rabbitmq_management |
6. To create a test user, provide the below command:
rabbitmqctl add_user test test rabbitmqctl set_user_tags test administrator rabbitmqctl set_permissions -p / test ".*" ".*" ".*" |
Install Zookeeper 3.5.7 as follows:
1. Copy the zookeeper tar file either from sax_bundle.
2. Extract it: tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz
3. Create datadir inside zk installation directory
4. Open /zookeeper-3.5.7/conf and cp zoo_sample.cfg zoo.cfg and edit zoo.cfg
5. Set the IP and Port in zoo.cfg file: server.1=IP:2888:3888
6. Start the zookeeper from /zookeeper-3.5.7/bin with ./zkServer.sh start.
Install Postgres 10 as follows:
1. Install Postgres repo as a root user into the system:
rpm -Uvh https://yum.postgresql.org/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm |
2. Install Postgresql10:
yum install postgresql10-server postgresql10 |
3. Initialize PGDATA:
/usr/pgsql-10/bin/postgresql-10-setup initdb And start the postgres: systemctl start postgresql-10.service |
4. Login into Postgres
su - postgres -c "psql" |
5. Change the postgres password
postgres=# \password postgres |
1. Login as a postgres user:
su - postgres. cd /10/data and edit the pg_hba.conf. |
2. Add the IP's in IP4 to allow the permission:
host all all 0.0.0.0/0 md5 host replication postgres 10.1.2.0/24 md5 |
3. Edit the postgresql.conf and replace listen_address from localhost to *
. listen_addresses = '*'
4. Restart the postgres:
systemctl restart postgresql-10.service |
This is an optional component; However, it is important as we are using this for monitoring Gathr pipelines.
Install ElasticSearch 6.4.1 as follows:
1. Copy Elasticsearch from Gathr_bundle.
2. Extract the bundle.
3. Open /elasticsearch-6.4.1/conf/elasticsearch.yaml and make these changes:
cluster.name: ES641 node.name: IP of the machine path.data: /tmp/data path.logs: /tmp/logs network.host: IP of the machine http.port: 9200 discovery.zen.ping.unicast.hosts: ["IP"] |
Settings for Elasticsearch
Note: Make sure to increase the max file descriptors [4096] for elasticsearch process to at least [65536] as follows:
Add the line given below in /etc/security/limits.conf
sax soft nofile 65536 sax hard nofile 65536 sax memlock unlimited |
Note: Sax is the user from which you are starting Elasticsearch.
Note: Make sure to increase the max virtual memory areas vm.max_map_count [65530] to at least [262144]
Run the following command:
sudo sysctl -w vm.max_map_count=262144 |
4. Start elastic search in background:
nohup ./elasticsearch & |
Install and run in embedded mode
1. Copy the Gathr tar file from the Gathr_bundle to the Virtual Machine.
2. Extract the tar.
3. Run this command to start Gathr in embedded mode:
cd bin ./startServicesServer.sh -deployment.mode=embedded |
Note: Logs are located in <GathrInstallationDir>/logs and <GathrInstallationDir>/server/tomcat/logs directories.
User can check the log files in these directories for any issues during installation process.
4. To open Gathr, http://<Public_IP>:8090/Gathr
5. Accept the End User License Agreement and hit Next button.
The Upload License page opens.
6. Upload the license and confirm.
7. Login page is displayed.
Note: Now the user will need to switch Gathr from embedded to cluster mode.
Follow the sections given below in order to switch Gathr from embedded to cluster mode
Login to Gathr using default username & password.
Navigate to the Setup >> Gathr and update the below details:
- Gathr Web URL
- Zookeeper Gathr Node
- Zookeeper Configuration Node
Navigate to the Setup >> Database and update the below details:
- Connection URL
- User
- Password
- Run Script
Note: Please check the Run Script and click on Save it will execute the DDL & DML in Gathr Metastore.
Navigate to the Setup >> Messaging Queue and update the below details:
- Messaging Type
- Host List
- User
- Password
Navigate to the Setup >> Elasticsearch and update the below details:
- Elasticsearch Connection URL
- Elasticsearch Cluster Name
Update the Zookeeper properties in Gathr Configuration with the below mentioned path:
<Gathr_install_dir>/conf/yaml/
Update Zookeeper property in the file “env-config.yaml”:
Update the Zookeeper properties in Gathr Configuration with the below mentioned path:
<Gathr_install_dir>/conf/
Update Zookeeper property in the file “config.properties”:
After updating the details, restart Gathr with -config.reload=true.
Copy Cloud Vendor specific war
Copy Cloud Vendor specific war file into tomcat.
For AWS
cp <Gathrinstallationlocation>/lib/emrservice.war <Gathrinstallationlocation>/server/tomcat/webapps/ |
For AWS-Databricks
cp <Gathrinstallationlocation>/lib/clusterMediator.war <Gathrinstallationlocation>/server/tomcat/webapps/ |
The war will get extracted in server/tomcat/webapps. Now stop tomcat and configure application files.
For AWS
cd <Gathrinstallationlocation>/bin ./stopServicesServer.sh update <Gathrinstallationlocation>/server/tomcat/webapps/emrservice/WEB-INF/classes/application.properties file spring.datasource.url=jdbc:postgresql://<GathrPrivateIP>:5432/DBNAME spring.datasource.username=username spring.datasource.password=password spring.datasource.driver-class-name=org.postgresql.Driver |
For AWS-Databricks
cd <Gathrinstallationlocation>/bin ./stopServicesServer.sh update <Gathr installation location>/server/tomcat/webapps/cluster-mediator /WEB-INF/classes/application.properties file spring.datasource.url=jdbc:postgresql://<GathrPrivateIP>:5432/DBNAME spring.datasource.username=username spring.datasource.password=password spring.datasource.driver-class-name=org.postgresql.Driver |
Configure Cloud Vendor specific details in yaml
Copy Cloud Vendor specific war file into tomcat
For AWS
Configure AWS details in Yaml.
Open config.yaml file from File: (<GathrInstallationDir>/Gathr /conf/yaml/env-config.yaml)
and append the content given below:
emr: instance.url: "http://<GathrPrivateIP>:8090/emrservice" s3.jar.upoadPath: "s3://sax-metadata" s3.log.uri: "s3://sax-metadata" isEnabled: "true" region: "us-west-2" |
aws s3 cp <Gathrinstallationlocation>/lib/spark-structured-sax-pipeline.jar s3://sax-metaData/ aws s3 cp $SAX_BUNDLE/init-scripts.sh s3://sax-metaData/ |
For AWS Databricks
Note: The following steps are required only in case you need AWS Databricks cluster in Gathr application.
Configure Databricks details in yaml
File: (<Gathrinstallationlocation >/Gathr/conf/yaml/env-config.yaml) |
Databricks
dbfs.jar.uploadPath: "/sax-databricks-jars " mediator.address: "http://<GathrPrivateIP>:8090/cluster-mediator/" isEnabled: "true" authToken: "<authtoken>" instanceUrl: "https://<databricks-instance-url>" |
Copy jar & init-scripts on DBFS
curl 'https://<databricks-instance>/api/2.0/dbfs/put' -H "Authorization: Bearer <personal-access-token-value>" -F contents=@<Gathr installation location>/lib/spark-structured-sax-pipeline.jar -F path="<sax metadata on dbfs path>/spark-structured-sax-pipeline.jar" curl 'https://<databricks-instance>/api/2.0/dbfs/put' -H "Authorization: Bearer <personal-access-token-value>" -F contents=@$SAX_BUNDLE/init-scripts.sh -F path="<sax metadata on dbfs path>/init-scripts.sh" |
Restart Gathr in Cluster Mode and upload license
cd <Gathrinstallationlocation>/bin ./startServicesServer.sh -config.reload=true |
After login with default user (superuser), below steps need to be checked:
1. Validate Default connections.
2. Validate Cluster List View and cluster creation.
3. Validate the workspace and project creation.
4. Associate token with the created user.
Go to Manage Users, select user for which we need to associate token.
Click on edit icon.
Click on this next button.
Now go below Databricks section. And tick on Token Associated Checkbox and select use existing. Enter Azure account mail as Username and the Token inside these boxes.
After entering this click on Update Button at right.
Now, logout and login as the workspace user you created. Now the user should also see the Cluster List View.
5. Create sample pipeline.
Go to workspace -> project -> pipeline.
Start local session and configure basic pipeline. (e.g. DG->RMQ). Save and exit.
6. Configure job for pipeline.
Select either existing cluster or new cluster to which we want to run this pipeline.
After cluster launch, the pipeline will come in STARTING state then ACTIVE state.
For logs, we can check in Databricks instance URL under Jobs:
Select your pipeline name for checking any logs.
After pipeline finish check data at emitter.
In case there are any updates to be done in configurations, you can restart Gathr by providing the below commands:
./startServicesServer.sh -config.reload=true ./stopServicesServer.sh |
1. Stop/Kill the Bootstrap Process.
2. Delete Gathr installation directory and its dependencies (like RMQ, ZK etc)
3. Delete the Gathr database.