Setup Gathr on AWS - Manual Deployment

The objective of this topic is to assist the user to install Gathr on AWS environment.

AWS Portal Access

To Setup Gathr, the user should have sufficient privileges to create and manage resources (Resource Group, Virtual machines, Virtual Networks, Subnet, Network Security Group, Gathr Webstudio) in AWS.

Hardware and Software Configurations

The below table provides the system requirements for the Gathr application:

Hardware/SoftwareRequirement
Machine Typem5.2xlarge or bigger
Disk Space30 GB
Operating SystemAmazon Linux 2, Centos 7.9, RHEL 7
sudo AccessRequired during installation
Internet AccessOptional (preferred during the installation)

AWS Gathr Setup

This section covers the steps to create resources in AWS that are essential to support Gathr application ecosystem.

Prerequisites

The following prerequisites are required for setting up Gathr:

  • VPC to launch AWS resources into a virtual network that you have defined.

  • Subnets (Public and Private).

  • NAT Gateway (Internet access for Private Subnet).

  • Elastic IP (In case if Gathr Webstudio is to be accessed publicly).

Create VPC and Subnets

VPC creation is required only if the user does not plan to launch this AMI in an existing VPC.

Even if you do not create a VPC, make sure that the existing VPC has the setup as described below.

Steps to Create VPC

  1. Click the Services drop-down and search for VPC

    Picture1

  2. Click Start VPC Wizard and select VPC with Public and Private Subnets.

    Picture2

  3. Make sure that the Public and Private subnets are in the same Availability Zone.

    Picture3

    Public Subnet that has Internet gateway access for Gathr web interface.

    Private Subnet for Gathr application.

  4. Create a new Elastic IP for the NAT Gateway.

    Picture4

  5. Click Create VPC.

    Picture5

    The Virtual Private Cloud is now created.

    Picture6

To know more about how to create a VPC, Subnets and other VPC resources, follow the topic, Create a VPC and Subnets.

IAM Access

This section cover details of IAM roles required to setup Gathr on AWS.

Setup IAM User

An IAM user is required to create an EC2 instance, Security group, VPC, Subnets, S3 bucket, Instance Profile etc.

A user with an AWS root user account has all the access that is necessary to launch Gathr on AWS. Otherwise, you can create an IAM User with the JSON policy.

Setup Role for EMR

You need to create three IAM roles “EMR_AutoScaling_DefaultRole”, “EMR_DefaultRole”, “EMR_EC2_DefaultRole”. These roles will be available as configuration values when you are creating an EMR cluster in Gathr Webstudio.

There are two ways of creating the EMR roles. These are explained below:

  • Create EMR Cluster which in-turn creates the required EMR roles.

If you have never created an EMR cluster, then create an EMR cluster in AWS console. It will create the necessary IAM roles in user’s AWS account.

  • Create the EMR roles manually.
  1. Create IAM Role: “EMR_AutoScaling_DefaultRole” and add the policies to it as shown in the screenshot below:

    EMR_AutoScaling_DefaultRole

    Next, update the ‘Trust Relationship’ of the above IAM Role with the content provided below:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": [
              "application-autoscaling.amazonaws.com",
              "elasticmapreduce.amazonaws.com",
              "ec2.amazonaws.com"
            ]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
  2. Create IAM Role: “EMR_DefaultRole” and add the given policies to it as shown in the screenshot below:

    EMR_DefaultRole

    Next, update the ‘Trust Relationship’ of the above IAM Role with the content provided below:

    {
      "Version": "2008-10-17",
      "Statement": [
        {
          "Sid": "",
          "Effect": "Allow",
          "Principal": {
            "Service": "elasticmapreduce.amazonaws.com"
            },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
    
  3. Create IAM Role: “EMR_EC2_DefaultRole” and add the given policies to it as shown in the screenshot below:

    EMR_EC2DefaultRole

    Next, update the ‘Trust Relationship’ of the above IAM Role with the content provided below:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
        "Effect": "Allow",
        "Principal": {
          "Service": "ec2.amazonaws.com"
          },
        "Action": "sts:AssumeRole"
        }
      ]
    }
    
    

Setup Role for Gathr Webstudio EC2

Create IAM Role “GathrWebstudio_EC2Role” and add the following inline JSON Policy to it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
        "ec2:\*",
        "kms:ListKeyPolicies",
        "kms:ListRetirableGrants",
        "kms:ListAliases",
        "kms:ListGrants",
        "iam:GetPolicyVersion",
        "iam:GetPolicy",
        "s3:ListAllMyBuckets",
        "iam:ListRoles",
        "sts:AssumeRole",
        "elasticmapreduce:\*"
        ],
        "Resource": "\*"
        },
        {
          "Sid": "VisualEditor2",
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "iam:PassRole",
            "s3:ListBucket",
            "s3:DeleteObject"
            ],
            "Resource": [
              "arn:aws:iam::<AWS\_Account\_ID>:role/EMR\_EC2\_DefaultRole",
              "arn:aws:iam::<AWS\_Account\_ID>:role/EMR\_DefaultRole","arn:aws:iam::<AWS\_Account\_ID>:role/EMR\_AutoScaling\_DefaultRole",
              "arn:aws:s3:::<S3\_Metadata\_Bucket\_Name>/\*",
              "arn:aws:s3:::<S3\_Metadata\_Bucket\_Name>" ] } ]}

Next, update the ‘Trust Relationship’ of the above IAM Role with the given content provided below:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
    }
  ]
}

Security Groups

Below are the ports required to be opened in VM Security Group:

Mandatory Port/Optional PortServicePort
OptionalZookeeper2181
MandatoryGathr (Non-SSL/SSL)   8090/8443
MandatorySSH22
OptionalRabbitMQ (Non-SSL/SSL)   5672,15672/15671
OptionalElasticsearch   9200-9300
OptionalPostgreSQL5432

Create the following security groups:

  1. SAX-WebServerSecurityGroup with following permissions:

    Inbound permission:

    Picture7

    Outbound permission:

    Picture8

  2. SAX-SAXEMR-SecurityGroup with following permissions:

    Inbound permission:

    Picture9

    The source is same security group “SAX-SAXEMR-SecurityGroup”

    Outbound permission:

    Picture10

S3 Bucket

The user requires to create a bucket for sax-metadata in the S3 account. Name it according to your organization’s naming standards. Use the same region that is used to launch Gathr EC2 node.

Setup Databricks (If using AWS-Databricks)

For running Gathr jobs on Databricks, the user must have Databricks Enterprise account subscription. Launch EC2 Instance using Databricks VPC and Databricks Public subnet, otherwise the user will be required to setup peering between Gathr Instance VPC and Databricks VPC.

Launch EC2 Instance for Gathr Webstudio

This EC2 instance will have all the required services that are essential for Gathr application to run successfully.

To launch the EC2 Instance do as follows:

Choose an AMI

Picture11

Choose Instance Type

Select instance type m5.2xlarge or larger.

Picture12

Configure Instance

VPC: Select a pre-created VPC from drop down.

Subnet: Select pre-created subnet from drop down.

Auto-assign IP: enable

IAM role: Select “GathrWebstudio_EC2Role” which you have created earlier.

Picture13

Click next on Network Interface.

On ‘Add Storage’ provide 100 GB storage.

Picture14

On ‘Add Tags’ provide Name to the EC2 instance.

Picture15

On ‘Configure Security Group’ page Select previously created security group i.e. ‘SAX-WebServerSecurityGroup’ and ‘SAX-SAXEMR-SecurityGroup’.

Picture16

Review settings and Launch instance by providing the PEM file.

Associate Elastic IP address (Optional)

Please select ’eth0’ as network interface and select Private IP of the instance.

Once the EC2 instance is up and running, continue from Section 4 to start setting up Gathr.

Install Software

This section describes the steps that user should take to install the prerequisite software on the Virtual Machine that has been launched on Cloud.

ssh into Gathr VM to continue with the following steps.

Install Java 8

  1. Install Java 8:

    yum install java-1.8.0-openjdk
    yum install java-1.8.0-openjdk-devel
    
  2. Set java home in. bashrc

    Get Java Home path by running the following command:

    alternatives --config java
    
    /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.el7\_9.x86\_64/jre/bin/java
    export JAVA\_HOME=/usr/lib/jvm/ java-1.8.0-openjdk-1.8.0.322.b06-1.el7\_9.x86\_64
    export PATH=$JAVA\_HOME/bin:$PATH
    

Install RabbitMQ

This is an optional component. However, it is important for pipeline error handling in Gathr.

Install this package before installing RMQ:

yum -y install epel-release
yum -y install erlang socat
  1. Download the package:

    wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.6.10/rabbitmq-server-3.6.10-1.el7.noarch.rpm
    rpm --import https://www.rabbitmq.com/rabbitmq-release-signing-key.asc
    rpm -Uvh rabbitmq-server-3.6.10-1.el7.noarch.rpm
    
  2. Start using the below command:

    systemctl start rabbitmq-server
    
  3. Enable it with the below command:

    systemctl enable rabbitmq-server
    
  4. To check the status, use the below command:

    systemctl status rabbitmq-server
    
  5. Enable the plugins with the below command:

    sudo rabbitmq-plugins enable rabbitmq\_management
    
  6. To create a test user, provide the below command:

    rabbitmqctl add\_user test test
    rabbitmqctl set\_user\_tags test administrator
    rabbitmqctl set\_permissions -p / test ".\*" ".\*" ".\*"
    

Install Zookeeper

Install Zookeeper 3.5.7 as follows:

  1. Copy the zookeeper tar file either from sax\_bundle.

  2. Extract it: tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz

  3. Create datadir inside zk installation directory.

  4. Open /zookeeper-3.5.7/conf and cp zoo\_sample.cfg zoo.cfg and edit zoo.cfg.

  5. Set the IP and Port in zoo.cfg file: server.1=IP:2888:3888.

  6. Start the zookeeper from /zookeeper-3.5.7/bin with ./zkServer.sh start.

Install Postgres 10

Install Postgres 10 as follows:

  1. Install Postgres repo as a root user into the system:

    rpm -Uvh https://yum.postgresql.org/10/redhat/rhel-7-x86\_64/pgdg-centos10-10-2.noarch.rpm
    
  2. Install Postgresql10:

    yum install postgresql10-server postgresql10
    
  3. Initialize PGDATA:

    /usr/pgsql-10/bin/postgresql-10-setup initdb
    And start the postgres: systemctl start postgresql-10.service
    
  4. Login into Postgres:

    su - postgres -c "psql"
    
  5. Change the Postgres password:

    postgres=# \password postgres
    

Settings for PostgreSQL

  1. Login as a postgres user:

    su - postgres.
    cd /10/data and edit the pg\_hba.conf.
    
  2. Add the IP’s in IP4 to allow the permission:

    host all all 0.0.0.0/0 md5
    host replication postgres 10.1.2.0/24 md5
    
  3. Edit the postgresql.conf and replace listen_address from localhost to *.

    listen\_addresses = '\*'
    
  4. Restart the Postgres:

    systemctl restart postgresql-10.service
    

Install ElasticSearch 6.4.1

This is an optional component. However, it is important as it is used for monitoring Gathr pipelines.

Install ElasticSearch 6.4.1 as follows:

  1. Copy Elasticsearch from Gathr_bundle.

  2. Extract the bundle.

  3. Open /elasticsearch-6.4.1/conf/elasticsearch.yaml and make these changes:

    cluster.name: ES641
    node.name: IP of the machine
    path.data: /tmp/data
    path.logs: /tmp/logs
    network.host: IP of the machine
    http.port: 9200
    discovery.zen.ping.unicast.hosts: ["IP"]
    

    Settings for Elasticsearch

    Add the line given below in /etc/security/limits.conf

    sax soft nofile 65536
    sax hard nofile 65536
    sax memlock unlimited
    

    Run the following command:

    sudo sysctl -w vm.max\_map\_count=262144
    
  4. Start Elasticsearch in the background:

    nohup ./elasticsearch &
    

Install Gathr

Install and run in embedded mode

  1. Copy the Gathr tar file from the Gathr_bundle to the Virtual Machine.

  2. Extract the tar.

  3. Run this command to start Gathr in embedded mode:

    cd bin
    ./startServicesServer.sh -deployment.mode=embedded
    

    User can check the log files in these directories for any issues during the installation process.

  4. To open Gathr, http://<Public_IP>:8090/Gathr.

  5. Accept the End User License Agreement and click on the Next button.

    Picture17

    The Upload License page opens.

    Picture18

  6. Upload the license and confirm.

    Picture19

  7. Login page is displayed.

    Picture20

    Follow the sections given below to switch Gathr from embedded to cluster mode.

    Login to Gathr using the default username & password.

    Picture21

    Navigate to the Setup » Gathr and update the below details:

    • Gathr Web URL

    • Zookeeper Gathr Node

    • Zookeeper Configuration Node

    Picture22

    Navigate to the Setup » Database and update the below details:

    • Connection URL

    • User

    • Password

    • Run Script

    Picture23

    Navigate to the Setup » Messaging Queue and update the below details:

    • Messaging Type

    • Host List

    • User

    • Password

    Picture24

    Navigate to the Setup » Elasticsearch and update the below details:

    • Elasticsearch Connection URL

    • Elasticsearch Cluster Name

    Picture25

Zookeeper Configuration

Update the Zookeeper properties in Gathr Configuration with the below-mentioned path:

<Gathr\_install\_dir>/conf/yaml/

Update Zookeeper property in the file env-config.yaml:

update_properties

Update the Zookeeper properties in Gathr Configuration with the below-mentioned path:

<Gathr\_install\_dir>/conf/

Update Zookeeper property in the file config.properties:

update_property

After updating the details, restart Gathr with -config.reload=true.

Picture26

Cloud Vendor War

Copy Cloud Vendor specific war

Copy Cloud Vendor specific war file into tomcat.

For AWS

cp <Gathrinstallationlocation>/lib/emrservice.war <Gathrinstallationlocation>/server/tomcat/webapps/

For AWS-Databricks

cp <Gathrinstallationlocation>/lib/clusterMediator.war <Gathrinstallationlocation>/server/tomcat/webapps/

The war will get extracted in server/tomcat/webapps. Now stop tomcat and configure application files.

For AWS

cd <Gathrinstallationlocation>/bin
./stopServicesServer.sh
update <Gathrinstallationlocation>/server/tomcat/webapps/emrservice/WEB-INF/classes/application.properties file
spring.datasource.url=jdbc:postgresql://<GathrPrivateIP>:5432/DBNAME
spring.datasource.username=username
spring.datasource.password=password
spring.datasource.driver-class-name=org.postgresql.Driver

For AWS-Databricks

cd <Gathrinstallationlocation>/bin
./stopServicesServer.sh

update <Gathr installation location>/server/tomcat/webapps/cluster-mediator /WEB-INF/classes/application.properties file

spring.datasource.url=jdbc:postgresql://<GathrPrivateIP>:5432/DBNAME
spring.datasource.username=username
spring.datasource.password=password
spring.datasource.driver-class-name=org.postgresql.Driver

Configure Cloud Vendor specific details in yaml

Copy Cloud Vendor specific war file into tomcat

For AWS

Configure AWS details in Yaml.

Open config.yaml file from File: (<GathrInstallationDir>/Gathr /conf/yaml/env-config.yaml)

and append the content given below:

emr:
instance.url: "http://<GathrPrivateIP>:8090/emrservice"
s3.jar.upoadPath: "s3://sax-metadata"
s3.log.uri: "s3://sax-metadata"
isEnabled: "true"
region: "us-west-2"
aws s3 cp <Gathrinstallationlocation>/lib/spark-structured-sax-pipeline.jar s3://sax-metaData/
aws s3 cp $SAX\_BUNDLE/init-scripts.sh s3://sax-metaData/

Copy jar & init-scripts on s3

For AWS Databricks

Configure Databricks details in yaml

File: (<Gathrinstallationlocation >/Gathr/conf/yaml/env-config.yaml)

Databricks

dbfs.jar.uploadPath: "/sax-databricks-jars "
mediator.address: "http://<GathrPrivateIP>:8090/cluster-mediator/"
isEnabled: "true"
authToken: "<authtoken>"
instanceUrl: "https://<databricks-instance-url>"

Copy jar & init-scripts on DBFS:

curl 'https://<databricks-instance>/api/2.0/dbfs/put' -H "Authorization: Bearer <personal-access-token-value>" -F contents=@<Gathr installation location>/lib/spark-structured-sax-pipeline.jar -F path="<sax metadata on dbfs path>/spark-structured-sax-pipeline.jar"

curl 'https://<databricks-instance>/api/2.0/dbfs/put' -H "Authorization: Bearer <personal-access-token-value>" -F contents=@$SAX\_BUNDLE/init-scripts.sh -F path="<sax metadata on dbfs path>/init-scripts.sh"

Restart Gathr in Cluster Mode and upload license:

cd <Gathrinstallationlocation>/bin
./startServicesServer.sh -config.reload=true

Basic Sanity

After login with default user (superuser), below steps need to be checked:

  1. Validate Default connections.

  2. Validate Cluster List View and cluster creation.

    Picture42

  3. Validate the workspace and project creation.

  4. Associate token with the created user.

    Go to Manage Users, select user for which we need to associate token.

    Picture43

    Click on edit icon.

    Picture44

    Click on the next button.

    Now, below Databricks section, tick on Token Associated Checkbox and select use existing.

    Enter Azure account mail as Username and the Token inside these boxes.

    Picture45

    After entering this click on Update Button at right.

    Picture46

    Now, logout and login as the workspace user you created. Now the user should also see the Cluster List View.

  5. Create a sample pipeline.

    Go to workspace -> project -> pipeline.

    Start a local session and configure a basic pipeline. (e.g. DG->RMQ).

    Save and exit.

  6. Configure job for the pipeline.

    Select either an existing cluster or a new cluster to run this pipeline.

    Picture47

    Picture48

    After cluster launch, the pipeline will come in STARTING state then ACTIVE state.

    For logs, we can check in Databricks instance URL under Jobs:

    Picture49

    Select your pipeline name for checking any logs.

    After the pipeline stops, check the data at the emitter.

Steps to Restart Gathr

In case there are any updates to be done in configurations, you can restart Gathr by providing the below commands:

./startServicesServer.sh -config.reload=true
./stopServicesServer.sh

Steps to Uninstall Gathr

  1. Stop/Kill the Bootstrap Process.

  2. Delete Gathr installation directory and its dependencies (like RMQ, ZK etc)

  3. Delete the Gathr database.

Top