File Storage Monitoring
Overview
Modern data architectures rely heavily on object storage—such as Amazon S3, Azure Blob Storage, or Google Cloud Storage—to store raw, semi-structured, or historical data. These file-based data lakes serve as the foundation for analytics, machine learning, and operational reporting, providing a cost-effective and flexible storage layer for massive volumes of data. While object storage is ideal for durability and scale, it can be challenging to monitor: file-level changes, schema evolution, and data quality issues are often hidden until they impact downstream analytics.
To provide visibility into file-based data lakes, Monte Carlo can monitor external tables that represent files in object storage. By mapping raw files into structured external tables, Monte Carlo can track schema changes, volume, freshness, and quality metrics—giving teams actionable observability over their file storage.
When you’re ready to get started, refer to the examples and guides linked at the bottom of this page. You can also access them from the sidebar.
Does Monte Carlo connect directly to my object storage?Not directly. Monte Carlo monitors external tables in your cloud data warehouse. Those external tables are the structured “mapping” that points to files in object storage—so Monte Carlo can observe the data through the warehouse connection.
What if I don’t use external tables or don’t have a cloud data warehouse yet?That’s okay! This guide includes examples you can reference to set up external tables in supported warehouses. And if you’d like, you can create a warehouse solely for this use case and scale it appropriately.
Feature Support
| Category | Monitor / Lineage Capabilities | Support |
|---|---|---|
| Table Monitor | Freshness (via opt-in volume monitor) | ✅ |
| Table Monitor | Volume (opt-in) | ✅ |
| Table Monitor | Schema Changes | ✅ |
| Table Monitor | JSON Schema Changes | ✅* |
| Metric Monitor | Metric | ✅ |
| Metric Monitor | Comparison | ✅ |
| Validation Monitor | Custom SQL | ✅ |
| Validation Monitor | Validation | ✅ |
| Job Monitor | Query performance | ❌ |
| Lineage | Lineage | ❌ |
*JSON Schema monitors are only supported in our AWS Redshift, Snowflake and GCP BigQuery integrations.
File Storage and External Table Provider Support
Depending on the cloud file storage provider you are using there are multiple cloud warehouses you can use to create external tables. Monte Carlo is able to integrate with all of these cloud warehouses.
| File Storage Provider | Databricks | Snowflake External Tables | AWS Redshift Spectrum External Tables | Azure Synapse (Dedicated SQL Pool) External Tables | GCP BigQuery BigLake External Tables | GCP BigQuery Non-BigLake External Tables | AWS Glue and Athena |
|---|---|---|---|---|---|---|---|
| AWS S3 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ |
| Azure Blob Storage | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| Azure Data Lake Storage Gen2 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| Azure General-purpose v2 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| Google Cloud Storage | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ |
File Type Support
Each cloud warehouse supports different file formats when creating external tables. Refer to the table below to see which file formats are supported as an external table source for each warehouse.
File format support is determined by the warehouse provider (for example, Snowflake, Redshift, Databricks, or BigQuery). Monte Carlo monitors the data exposed through these external tables, but does not control which file formats are supported. If you need support for additional file types, contact your warehouse provider.
| File Type | Databricks | Snowflake | AWS Redshift | Azure Synapse (Dedicated SQL Pool) | GCP BigQuery BigLake | GCP BigQuery Non-BigLake | AWS Glue |
|---|---|---|---|---|---|---|---|
| Delta | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ |
| Iceberg | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ |
| CSV | ✅ | ✅ | ✅ | ✅ (Hadoop only) | ✅ | ✅ | ✅ |
| JSON | ✅ | ✅ | ✅ | ✅ (Hadoop only) | ✅ | ✅ | ✅ |
| Parquet | ✅ | ✅ | ✅ | ✅ (Hadoop and Native) | ✅ | ✅ | ✅ |
| Avro | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| ORC | ✅ | ✅ | ✅ | ✅ (Hadoop only) | ✅ | ✅ | ✅ |
| XML | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Ion | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| grokLog | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Hive RCFile (Record Columnar File) | ❌ | ❌ | ❌ | ✅ (Hadoop only) | ❌ | ❌ | ❌ |
| RCFile (Record Columnar File) | ❌ | ❌ | ❌ | ✅ (Hadoop only) | ❌ | ❌ | ❌ |
FAQs
Does file storage monitoring mean Monte Carlo is connecting directly to the file storage?
No. Monte Carlo does not connect directly to object storage. Instead, your raw files must be mapped to external tables in a cloud data warehouse. Monte Carlo connects to the warehouse and monitors those external tables, which represent the underlying storage files
What are external tables and why would I use them for monitoring?
External tables let you query data stored in object storage (such as S3, ADLS, or GCS) without loading it into a warehouse. They provide a consistent, structured schema layer over raw files—making it possible to monitor schema changes, freshness, volume, and other data quality signals over time.
Do I need to ingest or copy my data into the warehouse for monitoring?
No. All monitoring is performed against metadata exposed through external tables. Your data remains in your storage system, and the warehouse simply provides a query interface.
Do I need to already have one of these cloud warehouses deployed to get started?
Not necessarily. If you don’t already use a cloud warehouse, you can set one up specifically for external table monitoring, if you’d like, and scale it appropriately.
Is there extra cost associated with monitoring external tables?
No. In Monte Carlo, monitoring an external table is priced the same as monitoring a permanent table—there’s no separate external-table fee. However, your cloud warehouse may incur compute and query costs to create and query external tables. Refer to your warehouse vendor’s pricing documentation for details.
Updated about 3 hours ago
For help creating external tables using the different cloud warehouses, see the related guide below.
