Databricks

What is Databricks?

Databricks is a platform for building, deploying, sharing, and maintaining data and analytics. It provides tools that help you connect your sources of data to one platform to process, store, share, analyze, and model datasets. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.

Why connect Databricks to Monte Carlo?

Connecting your Databricks account to Monte Carlo and providing access to the data within is essential for Monte Carlo to function. Monte Carlo collects the metadata from tables in your Databricks environment to enable our automated Freshness and Volume Monitors.

Monte Carlo is a Data Observability tool after all, so the biggest step in getting value from it is letting it observe your data!

What integrations does Monte Carlo currently support?

Monte Carlo currently supports integration with all three Databricks platforms - AWS, GCP, and Azure.
For Delta tables, Automated Freshness and Volume Monitors are available. You can opt in to any field health, dimension, custom SQL and SLI monitors for all three platforms. Please note that by default any volume monitors are measured in bytes, but we offer customers the ability to opt-in to row count monitoring for any table.

Table lineage can also be tracked for Databricks users with Unity Catalog.

Freshness and Volume:
Our freshness and volume monitors are active out of the box. By collecting metadata on your tables, Monte Carlo’s machine learning algorithm can set automatic thresholds to alert when a table has gone too long without an update, and when an abnormal amount of rows are added or deleted.

This document walks through the steps to monitor a Databricks environment with Monte Carlo. The order of operations is important and it is strongly recommended to adhere to the documented sequence. These steps need to be repeated for each Databricks Workspace that you would like to observe with Monte Carlo.

Please note the Table of Contents to the right for the full outline of steps.

0. Pre-requisites

If you are a self-hosted customer, upgrade your data collector. If you are an existing Monte Carlo customer with a self-hosted Data Collector, it is best to ensure your Data Collector is up-to-date before proceeding. You can check your Data Collector last updated time under Settings > Integrations -> Collectors.

πŸ“˜

Databricks Partner Connect

You can now connect to Monte Carlo directly through Databricks Partner Connect! This drastically simplifies the integration process and is our recommended approach. To use Databricks Partner Connect, simply do Step 3 (Create a SQL Warehouse for Monte Carlo) and proceed to Partner Connect to connect to Monte Carlo. Return to this document for Step 6 if using Unity Catalog.

1. Create a Personal Access Token or Service Principal

Creating a Personal Access Token is the simplest option to connect to Databricks. Databricks recommends using a Service Principal for API access though but it requires the Databricks CLI in order to create a Token.

Option 1: Creating a Service Principal (Recommended)

This option only is available if you are using Unity Catalog as Service Principals are a Unity Catalog feature.

  1. As a Databricks account admin, login to the Databricks Account Console, click on User Management, and the Service Principals tab.
  2. Click Add service principal, enter a Name for the service principal, and click Add.
  3. Ensure that the Service Principal has Databricks SQL access and Workspace access Entitlements.
  4. Follow the Databricks documentation for creating a Service Principal Token (requires Databricks APIs) and save that Token.

Option 2: Creating a Personal Access Token (Not recommended)

  1. You must be an Admin in the Databricks Workspace (admin access is required to generate the resources listed in step 4).
  2. In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
  3. On the Access tokens tab, click Generate new token.
  4. Enter a comment (monte-carlo-metadata-collection) that helps you to identify this token in the future and create a token with no lifetime by leaving the Lifetime (days) box empty. Click Generate.
  5. Copy/save the displayed Token, and then click Done.

2. Grant permissions

Configure permissions depending on the type of metastore. If the workspace has catalogs in UC and not UC, both need to be configured. If you want to restrict access to specific schemas, check this out: Setting up dataset/schema controls.

Unity Catalog

Grant permissions to the Service Principal to read the Catalogs. This will cascade to all schemas within:

GRANT USE CATALOG ON CATALOG <CATALOG> TO <monte_carlo_service_principal>;
GRANT USE SCHEMA ON CATALOG <CATALOG> TO <monte_carlo_service_principal>;
GRANT SELECT ON CATALOG <CATALOG> TO <monte_carlo_service_principal>;

For details, see Databricks documentation.

See Troubleshooting & FAQ: Databricks if the command returns a "Privilege SELECT is not applicable" error.

Lineage

Monte Carlo uses system tables to collect and present lineage.

  1. Follow the Databricks documentation to enable the system tables within Databricks. Execute the instructions for each workspace that has any Unity Catalog.
  2. Grant permissions to the Service Principal to read the system tables:
GRANT SELECT ON system.access.table_lineage TO <monte_carlo_service_principal>;
GRANT SELECT ON system.access.column_lineage TO <monte_carlo_service_principal>;

Hive metastore

Grant permissions to the Service Principal to read the Catalogs - the most common scenario is that only one catalog is present, hive_metastore. Permissions will cascade to all schemas within. In the Hive mestastore, the SELECT privilege requires USAGE:

GRANT USAGE, READ_METADATA, SELECT ON CATALOG <CATALOG> TO <monte_carlo_service_principal>;

For details, see Databricks documentation.

3. Create a SQL Warehouse for Metadata Collection

Adding the Metadata Connection allows Monte Carlo to gather metadata on a periodic basis.

If you prefer to use an All-Purpose Cluster, or do not have access to SQL Warehouses in Databricks, please let your Monte Carlo representative know or reach out to Support at [email protected].

  1. Follow these steps to create a SQL Warehouse.
    • For environments with 10,000 tables or fewer Monte Carlo recommends starting with a 2X-Small.
    • ⚠️If using Glue or the internal Databricks Metastore, the Serverless type is recommended for cost efficiency reasons, but Pro is also supported. For External Hive Metastores Pro is the only supported type - Serverless is not supported. ClassicSQL Warehouses are not supported (while it will connect, there are significant issues with performance).
    • Most Metadata operations are executed in the driver, so a 2X-Small should be enough, but a larger SQL Warehouse might be required depending on the number and size of Delta tables.
    • ⚠️ Set minimum and maximum number of clusters to 1 - counterintuitively, autoscaling can cause performance problems for metadata collection.
    • Please reach out to your account representative for help right-sizing!
  2. Save the Warehouse ID.
  3. Start the SQL Warehouse.

4. Create a SQL Warehouse for a Query Engine

Adding the SQL Connection allows Monte Carlo to run monitors and root-cause analysis queries.

If you prefer to use an All-Purpose Cluster, or do not have access to SQL Warehouses in Databricks, please let your Monte Carlo representative know or reach out to Support at [email protected].

πŸ“˜

Can I use the same SQL Warehouse for both connections?

While it is possible to use the same SQL warehouse for both metadata and query engine connections, because of the differences in requirements, we recommend using a different SQL Warehouse for each connection.

Metadata collection is composed of many small queries that do not process data, only metadata, so the SQL Warehouse can be small with few workers - a single worker should be enough for most cases.

The SQL Warehouse used as query engine needs to scale with the number and frequency of monitors and the data size.

  1. Follow these steps to create a SQL Warehouse.
    • The size of the SQL Warehouse depends on the number of monitors and data size, Monte Carlo recommends starting with 2X-Small and scaling afterwards as needed.
    • ⚠️ The Serverless type is recommended for cost efficiency reasons, but Pro is also supported. Classic is not supported (while it will connect, there are significant issues with performance).
    • ⚠️ Set minimum and maximum number of clusters to 1 - counterintuitively, autoscaling can cause performance problems. Autoscaling is useful in few situations, for example when the number of monitors spikes at specific times of the day.
    • Please reach out to your account representative for help right-sizing!
  2. Save the Warehouse ID.
  3. Start the SQL Warehouse.

5. Verify data access

Confirm that this SQL Warehouse has access to the catalogs, schemas, and tables that need to be monitored.

To check this, you can run the following commands in the Databricks SQL editor. If all of the commands work and show the objects you expect, this SQL Warehouse is configured correctly for the current user. Note this does not guarantee though the the job will work with the token configured in Monte Carlo if the token is from a different user. If it doesn't show the expected objects, this may be an issue with user permissions or the settings on the SQL Warehouse. Ensure that the SQL Warehouse is connecting to the correct metastore.

SHOW CATALOGS

SHOW SCHEMAS IN <CATALOG>

SHOW TABLES IN <CATALOG.SCHEMA>

DESCRIBE EXTENDED <CATALOG.SCHEMA.TABLE>

6. Add the Connections in Monte Carlo

This step uses the Monte Carlo UI to add the Connections. Please ensure the the SQL Warehouses are turned on in order to add the connections.

  1. To add the Connections, navigate to the Integrations page in Monte Carlo. If this page is not visible to you, please reach out to your account representative.
  2. Under the Data Lake and Warehouses section, click the Create button and Databricks.
  3. Use the Create Databricks metadata collection and querying connections button.
  4. Under Warehouse Name, enter the name of the connection that you would like to see in Monte Carlo for this Databricks Workspace.
  5. Under Workspace URL, enter the full URL of your Workspace, i.e. https://${instance_id}.cloud.databricks.com". Be sure to enter the https://.
  6. Under Workspace ID, enter the Workspace ID of your Databricks Workspace. If there is o= in your Databricks Workspace URL, for example, https://<databricks-instance>/?o=6280049833385130, the number after o= is the Databricks Workspace ID. Here the workspace ID is 6280049833385130. If there is no o= in the deployment URL, the workspace ID is 0.
  7. Under Personal Access or Service Principal Token, enter the Service Principal or Personal Access Token you created in Step 1.
  8. For Metadata Collection Jobs, enter the SQL Warehouse ID (Step 3.2).
  9. Under Query Engine, select the integration type that matches what you set up in Step 4 and enter the SQL Warehouse ID (Step 4.2).
  10. Click Create and validate that the connection was created successfully.

Recommended: Check the Metadata Job
When the metadata connection is added, Monte Carlo will immediately start running the metadata job. Because of the way the job is constructed, we will try to gather metadata about all of the tables in the environment. Oftentimes, the permissions on the SQL Warehouse prevent the metadata job from collecting information from certain tables. Its worthwhile to look at the job logs for the metadata job to see if there are any issues in collection.

Conclusion

You have connected all necessary integration points to get end-to-end observability for Databricks!

Advanced Options

In general, using this Databricks connection type will be sufficient. If your Databricks environment is connecting to an external metastore (Glue or Hive), and you wish to connect Monte Carlo directly to that metastore, we can still gather freshness and volume information on Delta Tables in the Databricks environment. Ask your Monte Carlo representative for more details.