Databricks setup
This document walks through the steps to monitor a Databricks environment with Monte Carlo. The order of operations is important and it is strongly recommended to adhere to the documented sequence. These steps need to be repeated for each Databricks Workspace that you would like to observe with Monte Carlo.
Please note the Table of Contents to the right for the full outline of steps.
1. Create a Personal Access Token or Service Principal
Creating a Personal Access Token is the simplest option to connect to Databricks. For customers on Unity Catalog, Databricks recommends using a Service Principal for API access but it requires the Databricks CLI in order to create a Token.
Option 1: Creating a Personal Access Token
- You must be an Admin in the Databricks Workspace (admin access is required to generate the resources listed in step 4).
- In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
- On the Access tokens tab, click Generate new token.
- Enter a comment (
monte-carlo-metadata-collection
) that helps you to identify this token in the future and create a token with no lifetime by leaving the Lifetime (days) box empty. Click Generate. - Copy/save the displayed Token, and then click Done.
Option 2: Creating a Service Principal
This option only is available if you are using Unity Catalog as Service Principals are a Unity Catalog feature.
- As a Databricks account admin, login to the Databricks Account Console, click on User Management, and the Service Principals tab.
- Click Add service principal, enter a Name for the service principal, and click Add.
- Ensure that the Service Principal has Allow cluster creation, Databricks SQL access, and Workspace access Entitlements.
- Follow the Databricks documentation for creating a Service Principal Token (requires Databricks APIs) and save that Token.
2. Create Cluster for Metadata Collection
Adding the Metadata Connection allows Monte Carlo to gather metadata on a periodic basis. Monte Carlo supports metadata collection through All-purpose Clusters and Job Clusters. You must build the connection with an All-purpose Cluster (as specified below) and then optionally swap to a Job Cluster.
- Follow these steps to create an all-purpose cluster in your workspace.
a. For environments with 10,000 tables or fewer Monte Carlo recommends using ani3.2xlarge
node type.
b. A Databricks runtime version with Spark >=3.0
is required.
c. The Metadata operations executed do not use Worker nodes, so it is recommended to disable autoscaling and only use 1 Worker node. Please reach out to your account representative for help right-sizing!
d. Please addspark.databricks.isv.product MonteCarlo+ObservabilityPlatform
to the cluster config - Navigate to the Cluster page, and copy/save the Cluster ID, which can be found at the end of the URL:
https://<databricks-instance>/#/setting/clusters/<cluster-id>
. For more details, see the Databricks documentation. - Start the cluster.
- Confirm that this cluster has access to the catalogs, schemas, and tables that need to be monitored. To check this, you can run the following commands in a notebook attached to your new cluster. If all of the commands work and show you the objects you expect, this cluster is configured correctly. If this doesn't show the expected objects, this may be an issue with the settings on the cluster. Ensure that the cluster is connecting to the correct metastore.
SHOW CATALOGS
SHOW SCHEMAS IN <CATALOG>
SHOW TABLES IN <CATALOG.SCHEMA>
DESCRIBE EXTENDED <CATALOG.SCHEMA.TABLE>
- Optional: Change to a Job Cluster
Because we are only running the metadata collection job through the metadata connection, we can switch the cluster used on the job to a Job cluster. Migrating the Databricks Metadata Job to a Job Cluster
3. Create a SQL Warehouse or Cluster for a Query Engine
A SQL Warehouse [BETA] is recommended as the method for Monte Carlo to execute queries on Databricks because it has better cost and availability, and additionally is quicker to spin up for running queries. If you prefer to use an All-Purpose Cluster, or do not have access to SQL Warehouses in Databricks, please follow the Spark guide.
- Follow these steps to create a SQL Warehouse. The sizing of that warehouse depends on the type and number of monitors that you wish to use within Monte Carlo.
- Save the Warehouse ID.
4. Add the Connections in Monte Carlo
This step use the Monte Carlo UI to add the Connections. Please ensure the the Clusters and Warehouse are turned on in order to add the connection.
This creates resources in your Databricks workspace.
This automates the creation of a secret, scope, directory, notebook and job to enable collection in your workspace. If you wish to create these resources manually instead, please reach out to your account representative.
- To add the Connections, navigate to the Integrations page in Monte Carlo. If this page is not visible to you, please reach out to your account representative.
- Under the Data Lake and Warehouses section, click the Create button and Databricks.
- Use the Create new integration button (or Add to Existing if applicable)
- Under Warehouse Name, enter the name of the connection that you would like to see in Monte Carlo for this Databricks Workspace.
- Under Workspace URL, enter the full URL of your Workspace, i.e.
https://${instance_id}.cloud.databricks.com"
. Be sure to enter thehttps://
. - Under Workspace ID, enter the Workspace ID of your Databricks Workspace. If there is
o=
in your Databricks Workspace URL, for example,https://<databricks-instance>/?o=6280049833385130
, the number aftero=
is the Databricks Workspace ID. Here the workspace ID is6280049833385130
. If there is noo=
in the deployment URL, the workspace ID is0
. - Under Personal Access Token or Service Principal Token, enter the Service Principal or Personal Access Token you created in Step 1.
- For Metadata Collection Jobs, under Cluster ID, enter the Cluster ID you saved in Step 2.2.
- Under Query Engine, select the integration type that matches what you set up in Step 3.
- Enter the SQL Warehouse ID or Cluster ID.
- Click Create and validate that the connection was created successfully.
Recommended: Check the Metadata Job
When the metadata connection is added, Monte Carlo will immediately start running the metadata job. Because of the way the job is constructed, we will try to gather metadata about all of the tables in the environment. Oftentimes, the permissions on the cluster prevent the metadata job from collecting information from certain tables. Its worthwhile to look at the job logs for the metadata job to see if there are any issues in collection.
Conclusion
You have connected all necessary integration points to get end-to-end observability for Databricks!
FAQ
Why does Monte Carlo Require multiple connections to Databricks?
Because we have 2 distinct sets of operations (metadata collection, and customized monitoring) we have 2 distinct cluster size requirements. Generally, metadata collection requires different cluster configuration than running customized queries, we split the connections to allow for connecting to different clusters (or warehouses) tuned to each need.
Can I use the same Cluster for both connections?
While it is possible to use the same cluster for both a Databricks (metastore), and
Spark connections, because of the differences in requirements, we recommend using a different cluster for each connection.
What Databricks APIs does Monte Carlo access?
When creating the metadata connection, and running metadata collection, Monte Carlo will need access to certain Databricks APIs. These APIs needed on initial setup are:
- /api/2.0/workspace/mkdirs
- /api/2.0/workspace/import
- /api/2.0/secrets/put
- /api/2.0/secrets/scopes/create
- /api/2.1/jobs/create
The APIs needed for continuous operation are:
- /api/2.1/jobs/runs/get
- /api/2.1/jobs/runs/get-output
- /api/2.1/jobs/run-now
- /api/2.0/clusters/get
- /api/2.0/clusters/start
Advanced Options
In general, using the Databricks (metastore) connection type will be sufficient. If your Databricks environment is connecting to an external metastore (Glue or Hive), and you wish to connect Monte Carlo directly to that metastore, we can still gather freshness and volume information on Delta Tables in the Databricks environment with the Databricks Delta lake connection, but this is being deprecated. Ask your Monte Carlo representative for more details.
Updated about 2 months ago