Databricks-Google Cloud - Compute Setup

Databricks on Google Cloud can be integrated with Gathr to provide users with a powerful platform to seamlessly design and deploy Gathr applications on Databricks-Google Cloud compute environment.

Users can connect to multiple Databricks workspaces from Gathr projects and submit applications. With Gathr, users can read data from various sources, apply transformations, and ingest it into desired target platforms.

To begin, register your Databricks-Google Cloud account with Gathr by setting up a compute environment in a few simple steps. Once registered, you can run Gathr applications on your own compute environments.


Prerequisites to Registering Databricks-Google Cloud Account with Gathr

  • GCP account with the required permissions to create a Databricks workspace.

    For information on the permissions required to create and manage a Databricks workspace on Google Cloud, please click here.

  • Databricks workspace in your preferred GCP region.

    Follow this guide to create a Databricks workspace: Create Databricks Workspace.

  • Gathr requires either an access token or service principal credentials to access the GCP Workspace for submitting and running pipeline jobs.

    To learn how to create an access token, please click here.

  • After generating the access token, go to the Gathr Compute setup page and enter the Databricks-Google Cloud details associated with your workspace.

  • For authentication using a service principal.

    The service principal must be associated with the workspace.

  • Optionally, the workspace should have Unity Catalog configured if needed. To set it up, please click here.

Ensure you have the following permissions in your Databricks workspace:

  • Databricks Workspace Permissions: Read-write permissions on the Databricks workspace to access necessary information.

  • Cluster Permissions: Create-edit permissions to allow Gathr to create and configure clusters for running applications.

  • Token Permissions: Create-manage token permissions in Databricks for Gathr to authenticate and interact with the Databricks environment.

  • DBFS Permissions: Read-write permissions on the Databricks File System (DBFS) for Gathr to store and access its binaries on DBFS.


Steps to Add Databricks-Google Cloud Account

Steps to register Databricks-Google Cloud account in Gathr are as follows:

  1. Login to Gathr and click User Profile (in left pane).

    gathr-landing-page

  2. On the User Profile page, switch to Compute Setup tab and click on ADD CLOUD SERVICE ACCOUNT option.

    Compute_Setup_Home

  3. Choose the Account Type as Databricks GCP.

    Select-Google-Cloud-Databricks-Account

  4. Provide your Databricks account details referring to the details given below:

    • Account Name:

      Provide a user-friendly alias for Databricks account (for example, Dev Account). This alias will be used for easy identification within Gathr.

    • Instance URL: On the Databricks workspace management page, look for the “Workspace URL” or “Instance URL” under the workspace details. The URL is typically listed in the format https://<databricks-instance>.databricks.com, where <databricks-instance> is a unique identifier for your instance.

      To learn more about the Workspace URLs, click here.

    • Authentication Type:

      Choose the authentication method for Gathr application when connecting to Databricks.

      • Token: Use this option to authenticate using a personal access token.

        Access Token: Paste the access token generated in your Databricks workspace. This token is required for Gathr to authenticate and interact with the Databricks environment.

        To learn more about creating a Databricks personal access token for your workspace user, click here.

      • Service Principal: Use this option to authenticate with a Service Principal.

        Client ID: Provide the unique identifier assigned to the Service Principal.

        Client Secret: Provide the client secret generated for the Service Principal.

        To learn how to obtain the Client ID and Client Secret from your Databricks account, refer to these steps.

    • DBFS Gathr Metadata Repository Path: Specify the path on the Databricks File System (DBFS) where you want to store Gathr binary and use it as the metadata root folder for Gathr.

      The path should begin with “/”.

      Example: /FileStore/gathrBinaries. Ensure that the provided path is accessible and has the necessary permissions.

    • Access Mode: Access mode is a security feature that determines who can use the compute and what data they can access via the compute.

      The options are:

      • None: No Access Mode is applied.

      • Single user: Run SQL, Python, R, and Scala workloads as a single user, with access to data secured in Unity Catalog.

      • Single User Access: Only one user is allowed to run commands on the cluster when Single User Access mode is enabled. This user must have ‘Can Attach To’ permission.

      If Authentication Type is set to Token, provide the User Name.

      You can find the User Name listed in the User Settings or Account Settings page. It is typically the same as the username you used to log in.

      If Authentication Type is set to Service Principal, provide the Service Principal Application ID.

      You can locate the Service Principal Application ID in your Databricks account under Settings > Identity and Access > Service Principals.

      Please refer to the Single User Access Mode Limitations on Unity Catalog, here

  5. Click on SAVE option and the Databricks account details will get added following a success message.


Associate Registered Databricks-Google Cloud Account to Project

Once the Databricks-Google Cloud Account is registered, it should be associated to Gathr projects.

This is a mandatory step. Once completed, only then the project users will be able to submit jobs on the registered Databricks compute environment’s workspace.

  • Only an individual signed up user or an Organization Administrator can associate the registered Databricks accounts to projects in Gathr.

Steps to Associate a Databricks Account to a Project:

  1. Navigate to the Projects page.

  2. Edit the project in which the registered Databricks compute environment should be associated.

  3. Update the required fields as per the table given below:

    FieldDescription
    Project NameThe name of the project cannot be edited.
    Account TypeThe cloud account mapped at the time of project creation will be shown. For example, Databricks-GCP.
    AccountOnly organization administrators can edit this field.
    The Databricks-Google Cloud accounts that are selected in this field will be available for submitting jobs for users in this project. Select Google Databricks Accounts
    Gathr Admin Account is selected by default and it signifies that Gathr Compute will always remain as an option for submitting jobs.
    Select the registered Databricks accounts that you want to have as deployment options.
    Tags(Optional) Tags can be added or removed for the Project.
    Description(Optional) Description can be added or updated for the project.
  4. Save the updated project details.


Compute Setup Listing Page

Databricks-Google Cloud accounts once registered with Gathr are shown on the Compute Setup listing page.

GCP_Databricks_CC_Listing_Page

The details that can be seen on the compute setup listing page for Databricks-Google Cloud account are:

  • Account Name

  • Instance URL

  • DBFS Path

  • Actions


Register Multiple Databricks-Google Cloud Accounts

Multiple Databricks-Google Cloud accounts can be added using the Add New Account button:

Add_GCP_Databricks_Account

The remaining steps will be same as described in this topic above.


Switch from Gathr Engine to GCP Databricks Engine

Gathr leverages Databricks Interactive Cluster to streamline the ETL, Ingestion, and CDC applications interactive development.

Gathr engine is your default engine when you login to Gathr. To take advantage of GCP Databricks interactive cluster you need to switch to Databricks engine.

The engines are shown above the User Settings option.

gathr-engines

Hover over the engine symbol to view the available engine options.

For more details, see how to Associate Registered Databricks-Google Cloud Account to Project.


Switching Between Engines

After registering a Databricks-Google Cloud account, and associating it to a project, users gain the option to switch between the Gathr and Databricks engines.

Refer to the topic, Steps to Add Databricks-Google Cloud Account for step-by-step guidance on registering a Databricks account with Gathr.

Connecting to Databricks Engine

Click on the preferred engine option to make the switch.

switch-engine

  • To connect to the Databricks engine, users need to select the registered Databricks-Google Cloud account from the options available.

    switch-to-databricks-engine-popup

  • After selecting the desired Databricks-Google Cloud account from the drop-down, click SWITCH.

  • The connection process may take time equivalent to the cluster initialization time.

Displaying GCP Databricks Engine Details:

  • Once connected, the GCP Databricks engine details, including cluster name, ID, and uptime, are prominently displayed for user reference.

    switch-to-gcp-databricks-engine


Leverage GCP Databricks Job Clusters to Submit Gathr Applications

Applications you create in Gathr can run either on a Gathr cluster or a GCP Databricks job cluster. This can be configured in two ways:

  • While saving application’s deployment preferences on the Pipeline Definition page you can select the registered Databricks account on which the job will run.

    Pipeline-Definition-Databricks-Deployment

  • Or after saving the application, same thing can be done from the listing page using the Change Cluster option.

    Listing-Databricks-Deployment

Top