Your data collector must be deployed before connecting a data lake.
See instructions here.
Integrating Monte Carlo with a data lake allows Monte Carlo to track health of tables and pipelines. By automatically pulling metadata, query logs and metrics, from the lake, Monte Carlo can provide end-to-end data observability.
To integrate a data lake, you will:
- Enable network connectivity between the data lake's components and Monte Carlo's data collector if they are not publicly accessible.
- Grant Monte Carlo access to your data lake components by creating read-only service accounts and/or IAM roles. See detailed instructions in dedicated guides.
- Provide credentials and roles to Monte Carlo through the onboarding wizard and CLI commands to validate and complete the integration.
A data lake setup typically consists of the following integrations, including at least one integration configured for each category.
Allows Monte Carlo to track the lake's tables, their schemas and other metadata
Allows Monte Carlo to track data freshness and volume at scale for tables stored on S3
Allows Monte Carlo to track lineage, usage analytics and query history; can use multiple sources, depending on the query engines used on the data lake
Allows Monte Carlo to run queries that granularly track data health on particular tables
Updated 4 months ago