Create a Databricks Data Source

Follow these steps to connect to a Databricks and create a Databricks data source:

  1. Configure User Access Rights

  2. Create a Databricks Data Source

Refer to Additional Information for additional performance tips and information about Databricks feature support.

Configure User Access Rights

We recommend that you create a dedicated user and user role for integration with the GoodData platform. The user must:

Create a Databricks Data Source

Once you have configured your Databricks user’s access rights, you can proceed to create a Databricks data source that you can then connect to.

UI
API

Steps:

  1. On the home page switch to Data sources.

    data sources tab
  2. Click Connect data.

    connect data
  3. Select Databricks.

    select data source type
  4. Name your data source and fill in your Databricks credentials and click Connect:

    db cretentials
  5. Input your schema name and click Save:

    db cretentials

    Your data source is created!

    db creation complete

Steps:

  1. Create a Databricks data source with the following API call:

    curl $HOST_URL/api/v1/entities/dataSources \
      -H "Content-Type: application/vnd.gooddata.api+json" \
      -H "Accept: application/vnd.gooddata.api+json" \
      -H "Authorization: Bearer $API_TOKEN" \
      -X POST \
      -d '{
        "data": {
          "type": "dataSource",
          "id": "<unique_id_for_the_data_source>",
          "attributes": {
            "name": "<data_source_display_name>",
            "type": "DATABRICKS",
            "url": "<DATABRICKS_JDBC_URL>",
            "token": "<DATABRICKS_PERSONAL_ACCESS_TOKEN>",
            "username": "<DATABRICKS_USERNAME>",
            "password": "<DATABRICKS_PASSWORD>",
            "schema": "<DATABRICKS_SCHEMA>",
            "parameters": [
              { "name": "catalog",
                "value": "<DATABRICKS_CATALOG>"
              }
            ]
          }
        }
      }' | jq .
    
  2. To confirm that the data source has been created, ensure the server returns the following response:

    {
      "data": {
        "attributes": {
          "name": "<data_source_display_name>",
          "type": "DATABRICKS",
          "url": "<DATABRICKS_JDBC_URL>",
          "username": "<DATABRICKS_USERNAME>",
          "schema": "<DATABRICKS_SCHEMA>",
          "parameters": [
            { "name": "catalog",
              "value": "<DATABRICKS_CATALOG>"
            }
          ]
        },
        "id": "databricks-datasource",
        "type": "dataSource"
      }
    }
    

Additional Information

Ensure you understand the following limitations and recommended practice.

Data Source Details

  • GoodData uses up-to-date JDBC drivers.

  • The JDBC URL must be in the following format:

    jdbc:databricks://<host>:<port>

  • You can use OAuth M2M authentication or a personal access token. Basic authentication is not supported.

  • The following Databricks database versions are supported:

    • 12.x (Apache Spark 3.3.2, Scala 2.12)
    • 11.x (Apache Spark 3.3.0, Scala 2.12)
    • 10.4 (Apache Spark 3.2.1, Scala 2.12)

Unsupported Features

GoodData does not support the following features:

  • REGR_INTERCEPT function (version 10.4 and 11.1)
  • REGR_SLOPE function (version 10.4)
  • REGR_R2 (version 10.4)
  • Referential integrity for non unity catalog:
    • Non unity catalog does not support referential integrity (primary and foreign keys).
    • Primary and foreign keys cannot be utilized when generating a logical data model (LDM). You have to set referential integrity by yourself in LDM Modeler.

Performance Tips

If your database holds a large amount of data, consider the following practices:

  • Denormalize the relational data model of your database. This helps avoid large JOIN operations. Because Databricks is a columnar database, queries read only the required columns and each column is compressed separately.

  • Index the columns that are most frequently used for JOIN and aggregation operations. Those columns may be mapped to attributes, labels, primary and foreign keys.

Query Timeout

Query timeout for a Databricks data source is currently not supported due to a limitation in the Databricks API.

Supported URL Parameters

  • transportMode
  • ssl
  • AuthMech
  • httpPath
  • UID