Object Storage: S3, GCS and ABS

Overview

Modern data architectures rely heavily on object storage—such as Amazon S3, Azure Blob Storage, or Google Cloud Storage—to store raw, semi-structured, or historical data. These file-based data lakes serve as the foundation for analytics, machine learning, and operational reporting, providing a cost-effective and flexible storage layer for massive volumes of data. While object storage is ideal for durability and scale, it can be challenging to monitor: file-level changes, schema evolution, and data quality issues are often hidden until they impact downstream analytics.

To provide visibility into file-based data lakes, Monte Carlo can monitor external tables that represent files in object storage. By mapping raw files into structured external tables, Monte Carlo can track schema changes, volume, freshness, and quality metrics—giving teams actionable observability over their file storage.

When you’re ready to get started, refer to the examples and guides linked at the bottom of this page. You can also access them from the sidebar.

👍

Does Monte Carlo require direct connectivity to object storage?

No. Monte Carlo monitors files through external tables in your cloud data warehouse. Those external tables are the structured “mapping” that points to files in object storage. This eliminates the need to maintain metadata about files on object storage in multiple systems and ensures consistency between your production pipelines and your data observability checks.

👍

What if I don’t use external tables for object storage files?

That’s okay! This guide includes examples you can reference to set up external tables in various warehouse and lake environments. If needed, you can easily create a cost efficient lake dedicated for this purpose, and scale it appropriately.


Feature Support

CategoryCapabilitySupport
Table MonitorFreshness (via opt-in volume monitor)
Table MonitorVolume (opt-in)
Table MonitorSchema Changes
Table MonitorJSON Schema Changes*
Metric MonitorMetric
Metric MonitorComparison
Validation MonitorCustom SQL
Validation MonitorValidation
Job MonitorQuery performance
LineageLineage

*JSON Schema monitors are only supported in our AWS Redshift, Snowflake and GCP BigQuery integrations.


Object Storage Support

Depending on the cloud object storage provider you are using there are multiple cloud warehouses or lakes you can use to monitor your files. Monte Carlo is able to integrate with all of these cloud warehouses.

DatabricksSnowflake External TablesAWS Redshift Spectrum External TablesAzure Synapse (Dedicated SQL Pool) External TablesGCP BigQuery BigLake External TablesGCP BigQuery Non-BigLake External TablesAWS Glue and Athena
AWS S3
Azure Blob Storage
Azure Data Lake Storage Gen2
Azure General-purpose v2
Google Cloud Storage

File Type Support

Each cloud warehouse supports different file formats when creating external tables. Refer to the table below to see which file formats are supported as an external table source for each warehouse.

File format support is determined by the warehouse provider (for example, Snowflake, Redshift, Databricks, or BigQuery). Monte Carlo monitors the data exposed through these external tables, but does not control which file formats are supported. If you need support for additional file types, contact your warehouse provider.

File TypeDatabricksSnowflakeAWS RedshiftAzure Synapse (Dedicated SQL Pool)GCP BigQuery BigLakeGCP BigQuery Non-BigLakeAWS Glue
Delta
Iceberg
CSV (Hadoop only)
JSON (Hadoop only)
Parquet (Hadoop and Native)
Avro
ORC (Hadoop only)
XML
Ion
grokLog
Hive RCFile (Record Columnar File) (Hadoop only)
RCFile (Record Columnar File) (Hadoop only)

FAQs

Does object storage monitoring mean Monte Carlo is connecting directly to the file storage?

No. Monte Carlo monitors files through external tables in your cloud data warehouse. Those external tables are the structured “mapping” that points to files in object storage. This eliminates the need to maintain metadata about files on object storage in multiple systems and ensures consistency between your production pipelines and your data observability checks.

What are external tables and why would I use them for monitoring?

External tables let you query data stored in object storage (such as S3, ADLS, or GCS) without loading it into a warehouse. They provide a consistent, structured schema layer over raw files—making it possible to monitor schema changes, freshness, volume, and other data quality signals over time. They represent a convenient way to store and manage metadata about where particular tables are located, what formats are used, what schema they should have, etc.

Do I need to ingest or copy my data into to allow monitoring?

No. All monitoring is performed against metadata exposed through external tables. Your data remains in your object storage, and the warehouse simply provides a query interface.

Do I need to already have one of these cloud warehouses deployed to get started?

Not necessarily. If you don’t already use a cloud warehouse, you can set one up specifically for external table monitoring, if you’d like, and scale it appropriately.

Is there extra cost associated with monitoring external tables?

No. In Monte Carlo, monitoring an external table is priced the same as monitoring a permanent table—there’s no separate external-table fee. However, your cloud warehouse may incur compute and query costs to create and query external tables. Refer to your warehouse vendor’s pricing documentation for details.


What’s Next

For detailed instructions on how to monitor files on object storage using various warehouse or lake technologies, follow the guides below.