Overview: Databricks

This document walks through the steps to monitor a Databricks with Monte Carlo. The order of operations is important and it is strongly recommended to adhere to the documented sequence.

Please note the Table of Contents to the right for the full outline of steps.

1. Monte Carlo CLI

The Monte Carlo CLI allows you to toggle on the different integration points for Databricks and provides the most automated experience. Install it locally by following the Using the CLI documentation.

2. Create a Personal Access Token or Service Principal

Creating a Personal Access Token is the simplest option to connect to Databricks. For customers on Unity Catalog, Databricks recommends using a Service Principal for API access but it requires the Databricks CLI in order to create a Token.

Option 1: Creating a Personal Access Token

  1. In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
  2. On the Access tokens tab, click Generate new token.
  3. Enter a comment (monte-carlo-metadata-collection) that helps you to identify this token in the future and create a token with no lifetime by leaving the Lifetime (days) box empty. Click Generate.
  4. Copy/save the displayed Token, and then click Done.

Option 2: Creating a Service Principal (Unity Catalog only)

  1. As a Databricks account admin, login to the Databricks Account Console, click on User Management, and the Service Principals tab.
  2. Click Add service principal, enter a Name for the service principal, and click Add.
  3. Ensure that the Service Principal has Allow cluster creation, Databricks SQL access, and Workspace access Entitlements.
  4. Follow the Databricks documentation for creating a Service Principal Token (requires Databricks APIs) and save that Token.

3. Create Cluster for Metadata Collection

Adding the Metadata Connection allows Monte Carlo to gather metadata on a periodic basis. Monte Carlo supports metadata collection through All-purpose Clusters and Job Clusters. You must build the connection with an All-purpose Cluster (as specified below) and then optionally swap to a Job Cluster.

  1. Follow these steps to create an all-purpose cluster in your workspace. For environments with 10,000 tables or fewer Monte Carlo recommends using an i3.2xlarge node type. A Databricks runtime version with Spark >= 3.0 is required. The Metadata operations executed do not use Worker nodes, so it is recommended to disable autoscaling and only use 1 Worker node. Please reach out to your account representative for help right-sizing!
  2. Navigate to the Cluster page, and copy/save the Cluster ID, which can be found at the end of the URL: https://<databricks-instance>/#/setting/clusters/<cluster-id>. For more details, see the Databricks documentation.
  3. Start the cluster.
  4. Confirm that this cluster has access to the catalogs, schemas, and tables that need to be monitored. To check this, you can run the following commands in a notebook that your new cluster. If all of the commands work and show you the objects you expect, this cluster is configured correctly. If this doesn't show the expected objects, this may be an issue with the settings on the cluster. Ensure that the cluster is connecting to the correct metastore.
SHOW CATALOGS

SHOW SCHEMAS IN <CATALOG>

SHOW TABLES IN <CATLALOG.SCHEMA>

DESCRIBE EXTENDED <CATALOG.SCHEMA.TABLE>
  1. Optional: Change to a Job Cluster
    Because we are only running the metadata collection job through the metadata connection, we can switch the cluster used on the job to a Job cluster. Migrating the Databricks Metadata Job to a Job Cluster

4. Create a SQL Warehouse or Cluster for a Query Engine

A SQL Warehouse [BETA] is recommended as the method for Monte Carlo to execute queries on Databricks. If you prefer to use an All-Purpose Cluster, please follow the Spark guide. See the FAQ section for why two connections are strongly recommended.

  1. Follow these steps to create a SQL Warehouse. The sizing of that warehouse depends on the type and number of monitors that you wish to use within Monte Carlo.
  2. Save the Warehouse ID.

5. Add the Metadata Connection

This step use the Monte Carlo UI to add the Metadata Connection.

🚧

This creates resources in your Databricks workspace.

This automates the creation of a secret, scope, directory, notebook and job to enable collection in your workspace. If you wish to create these resources manually instead, please reach out to your account representative.

  1. To add the Metadata Connection, navigate to the Integrations page in Monte Carlo. If this page is not visible to you, please reach out to your account representative.
  2. Under the Data Warehouse/Lake connections section, click the + button and Databricks.
  3. Leave the Connection Type as Databricks Metastore.
  4. Under Warehouse Name, enter the name of the connection that you would like to see in Monte Carlo for this Databricks Workspace.
  5. Under Workspace URL, enter the full URL of your Workspace, i.e. https://${instance_id}.cloud.databricks.com". Be sure to enter the https://.
  6. Under Workspace ID, enter the Workspace ID of your Databricks Workspace. If there is o= in your Databricks Workspace URL, for example, https://<databricks-instance>/?o=6280049833385130, the random number after o= is the Databricks Workspace ID. Here the workspace ID is 6280049833385130. If there is no o= in the deployment URL, the workspace ID is 0.
  7. Under Cluster ID, enter the Cluster ID you saved in Step 3.2.
  8. Under Access Token, enter the Service Principal or Personal Access Token you created in Step 2.
  9. Click Create.

Optional: Check the Metadata Job
When the metadata connection is added, we will immediately start running the metadata job. Because of the way the job is constructed, we will try to gather metadata about all of the tables in the environment. Oftentimes, the permissions on the Wcluster prevent the metadata job from collecting information from certain tables. Its worthwhile to look at the job logs for the metadata job to see if there are any issues in collection.

6. Add the Query Engine Connection

Adding the Query Engine for SQL Warehouse or an All-Purpose Spark Cluster requires the Monte Carlo CLI.

Option 1. SQL Warehouse

Use the montecarlo integrations add-databricks-sql-warehouse command to add the SQL Warehouse.

  1. Under databricks-workspace-url, enter the full URL of your Workspace, i.e. ${instance_id}.cloud.databricks.com". Do not include https://.
  2. Under databricks-warehouse-id, enter the Warehouse ID you saved in Step 4.2.
  3. Under databricks-token, enter the Service Principal or Personal Access Token you created in Step 2.
  4. If another Warehouse Connection exists in Monte Carlo already, under Name, use the same Name you referenced when adding the Metadata Collection Warehouse Name in Step 5.4.
montecarlo integrations add-databricks-sql-warehouse --databricks-workspace-url dbc-12345678-abcd.cloud.databricks.com --databricks-warehouse-id 1234567890abcdef --databricks-token -1 --name data-lake

Option 2. All-Purpose Spark Cluster

Use the montecarlo integrations add-spark-databricks command for an All-Purpose Spark Cluster.

Conclusion

You have connected all necessary integration points to get end-to-end observability for Databricks!

FAQ

Why does Monte Carlo Require multiple connections to Databricks?
Because we have 2 distinct sets of operations (metadata collection, and customized monitoring) we have 2 distinct cluster size requirements. Generally, metadata collection requires different cluster configuration than running customized queries, we split the connections to allow for connecting to different clusters (or warehouses) tuned to each need.

Can I use the same Cluster for both connections?
While it is possible to use the same cluster for both a Databricks (metastore), and
Spark connections, because of the differences in requirements, we recommend using a different cluster for each connection.

What Databricks APIs does Monte Carlo access?
When creating the metadata connection, and running metadata collection, Monte Carlo will need access to certain Databricks APIs. These APIs needed on initial setup are:

  • /api/2.0/workspace/mkdirs
  • /api/2.0/workspace/import
  • /api/2.0/secrets/put
  • /api/2.0/secrets/scopes/create
  • /api/2.1/jobs/create

The APIs needed for continuous operation are:

  • /api/2.1/jobs/runs/get
  • /api/2.1/jobs/runs/get-output
  • /api/2.1/jobs/run-now
  • /api/2.0/clusters/get
  • /api/2.0/clusters/start

Advanced Options

In general, using the Databricks (metastore) connection type will be sufficient. If your Databricks environment is connecting to an external metastore (Glue or Hive), and you wish to connect Monte Carlo directly to that metastore, we can still gather freshness and volume information on Delta Tables in the Databricks environment with the Databricks Delta lake connection, but this is being deprecated. Ask your Monte Carlo representative for more details.