Databricks Delta lake
Are you using the Databricks central Hive metastore?
If so, Delta lake is automatically supported. See here for details on setting up this integration.
These steps are only necessary if you are leveraging Delta with either a Glue catalog or an external Hive metastore. If you haven't connected Monte Carlo to either a Glue catalog or an external Hive metastore yet please do so now before proceeding with the rest of this guide.
Prerequisites
To complete this guide, you will need admin rights in your Databricks workspace.
To connect Monte Carlo to a Databricks Delta lake, follow these steps:
- Create a cluster in your Databricks workspace.
- Create an API key in your Databricks workspace.
- Provide service account credentials to Monte Carlo.
Create a Databricks cluster
Monte Carlo requires a Databricks runtime version with Spark >=
3.0
, and at least one worker.If you are using a Glue catalog: see here for additional cluster requirements.
If you are using an external Hive metastore: see here for additional cluster requirements.
-
Create an all-purpose cluster in your workspace. For environments with 10,000 tables or fewer Monte Carlo recommends using an
i3.2xlarge
node type. Otherwise please reach out to your account representative for help right-sizing. -
Follow this guide to retrieve the cluster ID and start the cluster.
Create an API key
- Follow these steps to generate an API key in your workspace with no specified lifetime. Monte Carlo recommends you create a service account associated with the token first.
- Save the generated token.
Provide service account credentials
Ensure the cluster is running!
Before moving forward, please ensure your Databricks cluster is running otherwise the command will time out!
Provide connection details for a Databricks Delta lake using Monte Carlo's CLI:
- Follow this guide to install and configure the CLI. Requires >=
0.25.1
. - Use the command
montecarlo integrations add-databricks-delta
to set up Delta connectivity (please ensure the Databricks cluster is running before running this command). For reference, see help for this command below:
This command creates resources in your Databricks workspace.
By default this command automates the creation of a secret, scope, directory, notebook and job to enable collection in your workspace. If you wish to create these resources manually instead, please reach out to your account representative. Otherwise only the
databricks-workspace-url
,databricks-workspace-id
,databricks-cluster-id
anddatabricks-token
options are required. See this guide for how to locate your workspace ID and URL. The cluster ID and token should have been generated in the previous steps.
$ montecarlo integrations add-databricks-delta --help
Usage: montecarlo integrations add-databricks-delta [OPTIONS]
Setup a Databricks Delta integration. For metadata queries on delta tables
when using an external metastore in databricks.
Options:
--databricks-workspace-url TEXT
Databricks workspace URL. [required]
--databricks-workspace-id TEXT Databricks workspace ID. [required]
--databricks-cluster-id TEXT Databricks cluster ID. [required]
--databricks-token TEXT Databricks access token. If you prefer a
prompt (with hidden input) enter -1.
[required]
--skip-secret-creation Skip secret creation.
--databricks-secret-key TEXT Databricks secret key. [default: monte-
carlo-collector-gateway-secret]
--databricks-secret-scope TEXT Databricks secret scope. [default: monte-
carlo-collector-gateway-scope]
--skip-notebook-creation Skip notebook creation. This option requires
setting 'databricks-job-id', 'databricks-
job-name', and 'databricks_notebook_path'.
--databricks-job-id TEXT Databricks job id, required if notebook
creation is skipped. This option requires
setting 'skip-notebook-creation'.
--databricks-job-name TEXT Databricks job name, required if notebook
creation is skipped. This option requires
setting 'skip-notebook-creation'.
--databricks-notebook-path TEXT
Databricks notebook path, required if
notebook creation is skipped. This option
requires setting 'skip-notebook-creation'.
--databricks-notebook-source TEXT
Databricks notebook source, required if
notebook creation is skipped. (e.g. "resourc
es/databricks/notebook/v1/collection.py")
This option requires setting 'skip-notebook-
creation'.
--collector-id UUID ID for the data collector. To disambiguate
accounts with multiple collectors.
--skip-validation Skip all connection tests. This option
cannot be used with 'validate-only'.
--validate-only Run connection tests without adding. This
option cannot be used with 'skip-
validation'.
--auto-yes Skip any interactive approval.
--name TEXT Friendly name of the warehouse which the
connection will belong to.
--option-file FILE Read configuration from FILE.
--help Show this message and exit.
FAQs
What Databricks platforms are supported?
All three Databricks platforms (AWS, GCP and Azure) are supported!
What if I am already using the Unity Catalog (UC) public preview? Is that supported too?
It is! See details here. Note that Freshness SLIs are not supported.
What about my non Delta tables?
Like with Delta tables you can opt into field health, dimension, custom SQL and SLI monitors. To enable write throughput and freshness please enable S3 metadata
events. See here for details on how to set up this integration.
How many Databricks workspaces are supported?
We can support multiple workspaces by setting up additional integrations. If you are using the unity catalog, you only need to set up a single connection to that catalog, even if there are multiple workspaces connected to it.
Do I need to set up a query engine connection too?
Yes, if you want to granularly tracking data health on particular tables (i.e. creating opt-in monitors). See here for details. Note that spark is not the only query engine supported, you can leverage others like Athena with Glue.
What is the minimum Data Collector version required?
The Databricks integration requires at least v 3127
of the Data Collector. You can use the Monte Carlo CLI to verify the current version of your Data Collector and upgrade if necessary.
$ montecarlo collectors list
If the Data Collector you are using is out-of-date you will see a "Databricks operations require DC v3127 or above" error during onboarding.
How do I handle a "Cannot make databricks job request for a DC with disabled remote updates" error?
If you have disabled remote updates on your Data Collector we cannot automatically provision resources in your Databricks workspace using the CLI. Please reach out to your account representative for details on how to create these resources manually.
How do I handle a "A Databricks connection already exists" error?
This means you have already connected to Databricks. You cannot have more than one Databricks metastore or Databricks delta integration.
How do I handle a "Exactly 1 Glue or Hive metastore is required" error?
This means either you have not yet connected to Glue/Hive, which is required for enabling delta support when not using the central Databricks metastore, or you have more than one connection which is not currently supported.
How do I handle a "Scope monte-carlo-collector-gateway-scope already exists" error?
This means a scope with this name already exists in your workspace. You can specify creating a scope with a different name using the --databricks-secret-scope
flag.
Alternatively, after carefully reviewing usage, you can delete the scope via the Databricks CLI/API. Please ensure you are not using this scope elsewhere as any secrets attached to the scope are not recoverable. See details here.
How do I handle a "Path (/monte_carlo/collector/integrations/collection.py) already exists" error?
This means a notebook with this name already exists in your workspace. If you can confirm this is a notebook provisioned by Monte Carlo and there are no existing jobs you should be able to delete the notebook via the Databricks CLI/API. See details here. Otherwise please reach out to your account representative.
How do I retrieve job logs?
- Open your Databricks workspace.
- Select Workflows from the sidebar menu.
- Select Jobs from the top and search for a job containing the name
monte-carlo-metadata-collection
. - Select the job.
- Select any run to review logs for that particular execution. The jobs should all show
Succeeded
for the status, but for partial failures (e.g. S3 permission issues) the log output will contain the errors and overall error counts.
Updated about 1 year ago