Data Lakes
Integrate Monte Carlo with your Data Lake
Your data collector must be deployed before connecting a data lake.
See instructions here.
Integrating Monte Carlo with a data lake allows Monte Carlo to track health of tables and pipelines. By automatically pulling metadata, query logs and metrics, from the lake, Monte Carlo can provide end-to-end data observability.
To integrate a data lake, you will:
- Enable network connectivity between the data lake's components and Monte Carlo's data collector if they are not publicly accessible.
- Grant Monte Carlo access to your data lake components by creating read-only service accounts and/or IAM roles. See detailed instructions in dedicated guides.
- Provide credentials and roles to Monte Carlo through the onboarding wizard and CLI commands to validate and complete the integration.
Integration points
A data lake setup typically consists of the following integrations, including at least one integration configured for each category.
Integration category | Purpose | Guides |
---|---|---|
Metadata | Allows Monte Carlo to track the lake's tables, their schemas and other metadata | |
Events (S3) | Allows Monte Carlo to track data freshness and volume at scale for tables stored on S3 | |
Query logs | Allows Monte Carlo to track lineage, usage analytics and query history; can use multiple sources, depending on the query engines used on the data lake | |
Query engine | Allows Monte Carlo to run queries that granularly track data health on particular tables |
Updated 10 months ago