Your data collector must be deployed before connecting a data lake.
See instructions here.
Integrating Monte Carlo with a data lake allows Monte Carlo to track health of tables and pipelines. By automatically pulling metadata, query logs and metrics, from the lake, Monte Carlo can provide end-to-end data observability.
To integrate a data lake, you will:
- Enable network connectivity between the data lake's components and Monte Carlo's data collector if they are not publicly accessible.
- Grant Monte Carlo access to your data lake components by creating read-only service accounts and/or IAM roles. See detailed instructions in dedicated guides.
- Provide credentials and roles to Monte Carlo through the CLI commands to validate and complete the integration.
In order to fully stand up Monte Carlo with your data lake, you'll need to integrate your storage system (S3), a metastore, and a query engine. MC will not work as intended if you don’t connect to all of these components.
If you’re using Databricks’ internal metastore, MC supports all three major cloud providers (AWS, GCP, Azure). In other words, MC can support you regardless of what storage system you use.
To set up your data lake with Monte Carlo, follow the links below for each data lake component.
|Metadata||Allows Monte Carlo to track the lake's tables, their schemas and other metadata||Hive (metastore),|
|Events (S3)||Allows Monte Carlo to track data freshness and volume at scale for tables stored on S3||S3 events|
|Query logs||Allows Monte Carlo to track lineage, usage analytics and query history; can use multiple sources, depending on the query engines used on the data lake||S3 Events - Query Logs,|
|Query engine||Allows Monte Carlo to run queries that granularly track data health on particular tables||Presto,|
Hive (SQL), Spark
|Providers||Allows Monte Carlo to better monitor metadata from more advanced storage layers||Delta lake|
Can Monte Carlo monitor just S3 / Athena / Hive / etc.?
Monte Carlo requires at least a storage system, a metastore, and a query engine to work as intended. This is because each of these systems provides a different slice of the metadata MC leverages for data observability purposes.
What does MC need to automatically generate lineage?
MC requires query logs for ETL procedures (DDL, DML). This usually means EMR or Presto logs accessible in S3, or a connection to Athena if you run ETL through Athena. MC currently does not support lineage in Spark, though a Databricks Unity-based lineage capability is coming soon for Databricks users.
What if we use multiple query engines?
MC currently only supports one query engine type per environment. This means that if you have multiple query engines (Spark and Athena), or have multiple accounts (like Spark clusters), we ask that you select one that suits your use case.
For advanced users of custom monitors, we support multiple query connections of the same type via the API/SDK. For example, you can now setup multiple Spark warehouse connections with varying config to optimize for performance and cost. See API docs here for details, usage examples, and any limitations on this feature. Note that MC currently does not support mixing engines from different connection types. Reach out to your Monte Carlo representative or click on the chat bot in the lower right hand corner if you have additional questions.
Updated 4 months ago