To complete this guide, you will need admin rights in your Databricks workspace.
To connect Monte Carlo to the Databricks Unity Catalog (UC), follow these steps:
- Create a cluster in your Databricks workspace.
- Create an API key in your Databricks workspace.
- Provide service account credentials to Monte Carlo.
Monte Carlo requires a Databricks runtime version with Spark >=
3.0, and at least one worker.
Are you also using an external metastore?
If you are using the built-in (i.e. central) Databricks Hive metastore this is automatically supported when you provision a UC cluster. If you want to use UC with either the Glue Catalog or an external hive metastore instead please follow the guides below for additional cluster requirements:
Follow these steps to create an UC compatible all-purpose cluster in your workspace. For environments with 10,000 tables or fewer Monte Carlo recommends using an
i3.2xlargenode type. Otherwise please reach out to your account representative for help right-sizing.
Follow this guide to retrieve the cluster ID and start the cluster.
- Follow these steps to generate an API key in your workspace with no specified lifetime. Monte Carlo recommends you create a service account associated with the token first.
- Save the generated token.
Provide connection details for the Databricks Unity Catalog (UC), using Monte Carlo's CLI:
- Follow this guide If you are only using UC or using UC with the built-in (i.e. central) Databricks Hive metastore.
- Follow this guide if you are using UC with either a Glue catalog or an external hive metastore.
What Databricks platforms are supported?
All three Databricks platforms (AWS, GCP and Azure) are supported!
What about Delta Lake?
This integration does support Delta tables too! Delta size and freshness metrics are monitored out of the box. You can also opt in to any field health, dimension, custom SQL and SLI monitors as well. See here for additional details.
What about my non Delta tables?
Like with Delta tables you can opt into field health, dimension, custom SQL and SLI monitors. To enable write throughput and freshness please enable S3
metadata events. See here for details on how to set up this integration.
How many Databricks workspaces are supported?
We can support multiple workspaces by setting up additional integrations. If you are using the unity catalog, you only need to set up a single connection to that catalog, even if there are multiple workspaces connected to it.
Are there any limitations?
Freshness SLIs with Glue or an external Hive metastore are not supported.
Do I need to set up a query engine connection too?
If only using UC or UC with the the built-in (i.e. central) Databricks Hive metastore the cluster you created above can also be used for granularly tracking data health on particular tables (i.e. creating opt-in monitors).
Otherwise, yes if you want to granularly tracking data health on particular tables (i.e. creating opt-in monitors). See here for details. Note that spark is not the only query engine supported, you can leverage others like Athena with Glue.
Updated about 2 months ago