File Storage Monitoring

Overview

Modern data architectures rely heavily on object storage—such as Amazon S3, Azure Blob Storage, or Google Cloud Storage—to store raw, semi-structured, or historical data. These file-based data lakes serve as the foundation for analytics, machine learning, and operational reporting, providing a cost-effective and flexible storage layer for massive volumes of data. While object storage is ideal for durability and scale, it can be challenging to monitor: file-level changes, schema evolution, and data quality issues are often hidden until they impact downstream analytics.

To provide visibility into file-based data lakes, Monte Carlo can monitor external tables that represent files in object storage. By mapping raw files into structured external tables, Monte Carlo can track schema changes, volume, freshness, and quality metrics—giving teams actionable observability over their file storage.

When you’re ready to get started, refer to the examples and guides linked at the bottom of this page. You can also access them from the sidebar.

👍

Does Monte Carlo connect directly to my object storage?

Not directly. Monte Carlo monitors external tables in your cloud data warehouse. Those external tables are the structured “mapping” that points to files in object storage—so Monte Carlo can observe the data through the warehouse connection.

👍

What if I don’t use external tables or don’t have a cloud data warehouse yet?

That’s okay! This guide includes examples you can reference to set up external tables in supported warehouses. And if you’d like, you can create a warehouse solely for this use case and scale it appropriately.


Feature Support

CategoryMonitor / Lineage CapabilitiesSupport
Table MonitorFreshness (via opt-in volume monitor)
Table MonitorVolume (opt-in)
Table MonitorSchema Changes
Table MonitorJSON Schema Changes*
Metric MonitorMetric
Metric MonitorComparison
Validation MonitorCustom SQL
Validation MonitorValidation
Job MonitorQuery performance
LineageLineage

*JSON Schema monitors are only supported in our AWS Redshift, Snowflake and GCP BigQuery integrations.


File Storage and External Table Provider Support

Depending on the cloud file storage provider you are using there are multiple cloud warehouses you can use to create external tables. Monte Carlo is able to integrate with all of these cloud warehouses.

File Storage ProviderDatabricksSnowflake External TablesAWS Redshift Spectrum External TablesAzure Synapse (Dedicated SQL Pool) External TablesGCP BigQuery BigLake External TablesGCP BigQuery Non-BigLake External TablesAWS Glue and Athena
AWS S3
Azure Blob Storage
Azure Data Lake Storage Gen2
Azure General-purpose v2
Google Cloud Storage

File Type Support

Each cloud warehouse supports different file formats when creating external tables. Refer to the table below to see which file formats are supported as an external table source for each warehouse.

File format support is determined by the warehouse provider (for example, Snowflake, Redshift, Databricks, or BigQuery). Monte Carlo monitors the data exposed through these external tables, but does not control which file formats are supported. If you need support for additional file types, contact your warehouse provider.

File TypeDatabricksSnowflakeAWS RedshiftAzure Synapse (Dedicated SQL Pool)GCP BigQuery BigLakeGCP BigQuery Non-BigLakeAWS Glue
Delta
Iceberg
CSV (Hadoop only)
JSON (Hadoop only)
Parquet (Hadoop and Native)
Avro
ORC (Hadoop only)
XML
Ion
grokLog
Hive RCFile (Record Columnar File) (Hadoop only)
RCFile (Record Columnar File) (Hadoop only)

FAQs

Does file storage monitoring mean Monte Carlo is connecting directly to the file storage?

No. Monte Carlo does not connect directly to object storage. Instead, your raw files must be mapped to external tables in a cloud data warehouse. Monte Carlo connects to the warehouse and monitors those external tables, which represent the underlying storage files

What are external tables and why would I use them for monitoring?

External tables let you query data stored in object storage (such as S3, ADLS, or GCS) without loading it into a warehouse. They provide a consistent, structured schema layer over raw files—making it possible to monitor schema changes, freshness, volume, and other data quality signals over time.

Do I need to ingest or copy my data into the warehouse for monitoring?

No. All monitoring is performed against metadata exposed through external tables. Your data remains in your storage system, and the warehouse simply provides a query interface.

Do I need to already have one of these cloud warehouses deployed to get started?

Not necessarily. If you don’t already use a cloud warehouse, you can set one up specifically for external table monitoring, if you’d like, and scale it appropriately.

Is there extra cost associated with monitoring external tables?

No. In Monte Carlo, monitoring an external table is priced the same as monitoring a permanent table—there’s no separate external-table fee. However, your cloud warehouse may incur compute and query costs to create and query external tables. Refer to your warehouse vendor’s pricing documentation for details.


What’s Next

For help creating external tables using the different cloud warehouses, see the related guide below.