Overview: Data Lakes

Integrate Monte Carlo with your Data Lake

📘

Your data collector must be deployed before connecting a data lake.

See instructions here.

Integrating Monte Carlo with a data lake allows Monte Carlo to track health of tables and pipelines. By automatically pulling metadata, query logs and metrics, from the lake, Monte Carlo can provide end-to-end data observability.

To integrate a data lake, you will:

  1. Enable network connectivity between the data lake's components and Monte Carlo's data collector if they are not publicly accessible.
  2. Grant Monte Carlo access to your data lake components by creating read-only service accounts and/or IAM roles. See detailed instructions in dedicated guides.
  3. Provide credentials and roles to Monte Carlo through the CLI commands to validate and complete the integration.

Integration points

In order to fully stand up Monte Carlo with your data lake, you'll need to integrate your storage system (S3), a metastore, and a query engine. MC will not work as intended if you don’t connect to all of these components.

If you’re using Databricks’ internal metastore, MC supports all three major cloud providers (AWS, GCP, Azure). In other words, MC can support you regardless of what storage system you use.

To set up your data lake with Monte Carlo, follow the links below for each data lake component.

Integration categoryPurposeGuides
MetadataAllows Monte Carlo to track the lake's tables, their schemas and other metadataHive (metastore),
Glue,
Databricks (metastore),
Unity Catalog
Events (S3)Allows Monte Carlo to track data freshness and volume at scale for tables stored on S3S3 events
Query logsAllows Monte Carlo to track lineage, usage analytics and query history; can use multiple sources, depending on the query engines used on the data lakeS3 Events - Query Logs,
Athena
Query engineAllows Monte Carlo to run queries that granularly track data health on particular tablesPresto,
Athena,
Hive (SQL), Spark
ProvidersAllows Monte Carlo to better monitor metadata from more advanced storage layersDelta lake

FAQs

Can Monte Carlo monitor just S3 / Athena / Hive / etc.?
Monte Carlo requires at least a storage system, a metastore, and a query engine to work as intended. This is because each of these systems provides a different slice of the metadata MC leverages for data observability purposes.

What does MC need to automatically generate lineage?
MC requires query logs for ETL procedures (DDL, DML). This usually means EMR or Presto logs accessible in S3, or a connection to Athena if you run ETL through Athena. MC currently does not support lineage in Spark, though a Databricks Unity-based lineage capability is coming soon for Databricks users.

What if we use multiple query engines?
MC currently only supports one query engine type per environment. This means that if you have multiple query engines (Spark and Athena), or have multiple accounts (like Spark clusters), we ask that you select one that suits your use case.

For advanced users of custom monitors, we support multiple query connections of the same type via the API/SDK. For example, you can now setup multiple Spark warehouse connections with varying config to optimize for performance and cost. See API docs here for details, usage examples, and any limitations on this feature. Note that MC currently does not support mixing engines from different connection types. Reach out to your Monte Carlo representative or click on the chat bot in the lower right hand corner if you have additional questions.