Data Lakes

Integrate Monte Carlo with your Data Lake

📘

Your data collector must be deployed before connecting a data lake.

See instructions here.

Integrating Monte Carlo with a data lake allows Monte Carlo to track health of tables and pipelines. By automatically pulling metadata, query logs and metrics, from the lake, Monte Carlo can provide end-to-end data observability.

To integrate a data lake, you will:

  1. Enable network connectivity between the data lake's components and Monte Carlo's data collector if they are not publicly accessible.
  2. Grant Monte Carlo access to your data lake components by creating read-only service accounts and/or IAM roles. See detailed instructions in dedicated guides.
  3. Provide credentials and roles to Monte Carlo through the onboarding wizard and CLI commands to validate and complete the integration.

Integration points

A data lake setup typically consists of the following integrations, including at least one integration configured for each category.

Integration category

Purpose

Guides

Metadata

Allows Monte Carlo to track the lake's tables, their schemas and other metadata

Hive (metastore),
Glue

Events (S3)

Allows Monte Carlo to track data freshness and volume at scale for tables stored on S3

S3 events

Query logs

Allows Monte Carlo to track lineage, usage analytics and query history; can use multiple sources, depending on the query engines used on the data lake

EMR/Presto logs on S3,
Athena

Query engine

Allows Monte Carlo to run queries that granularly track data health on particular tables

Presto,
Athena,
Hive (SQL), Spark


Did this page help you?