Data Explorer (beta)

Data Explorer makes it easy to profile the contents of a table or view. This can be helpful when investigating a data quality issue reported by a business partner, when considering which monitors to create, or when simply getting familiar with the contents of a table.

The experience is interactive and no-code, making it approachable for less technical roles. Users can point and click to adjust the time range of data and filter down for particular segments. In the future, we'd like to make it easy to compare multiple segments of data side-by-side.

Currently, Data Explorer is available for the Snowflake integration by default and is opt-in for Databricks and BigQuery. Your Monte Carlo Data Collector must be version 16624 or higher.

Using Data Explorer

πŸ“˜

Beta Row Limit

In beta, the Data Explorer only allows for 20 million rows to be actively profiled. If more than 20 million rows are in the set, you will be prompted to filter on a smaller amount of time (top right) or apply a custom WHERE clause.

Data Explorer is a tab within the Assets page for a table or view.

When a user loads the Data Explorer tab, it executes queries against the source warehouse to retrieve up to date statistics about the table. See the Architecture section to learn more about how the results of these queries are handled, to ensure data is not stored by Monte Carlo.

Adjust the slider in Row count and click on values in Segments to filter the rest of the data in the dashboard.

Adjust the slider in Row count and click on values in Segments to filter the rest of the data in the dashboard.

Data Explorer contains the following:

  • Filters: By default, a filter is applied for the trailing 7 days on a user-selected time field. The time filter can be changed using the time range selector at the top of the Assets page. You can further filter the data by applying a custom WHERE clause.
  • Segments: selecting a field to segment provides a distribution of the field data for the filtered range and allows you to further filter the Row count and Field profile. Results are limited to the 50 most frequent values. This section is not shown if Data Sampling is disabled in your account.
  • Row count: histogram of the count of rows, aggregated using a user-selected time field. At the bottom of this section is an easily adjustable slider to shorten or slide the desired time range.
  • Field profile: common statistical metrics for each field, like the count and % of nulls, count and % unique, minimum, maximum, mean, and standard deviation. Users can drill down into specific metrics to see how they trend over time.

It's intended for users to filter and refine for a specific segment or time range of data in the first 3 sections, so they can then see the Field Profile for that segment.

Metric Drill-down

Users can also drill-down into a field profile metric to see how that metric has trended over time. This helps a user to validate an issue reported by a business partner. For example, if someone is reporting a sudden spike of nulls in a key field… it’s now much easier to validate that without ever writing SQL or setting up a monitor.

🚧

Time series data only

Drill-downs only work with tables that have a timestamp, as this is how the metric is plotted over time.

Drill-down into a specific field metric to see how has trended over time.

Drill-down into a specific field metric to see how has trended over time.

Architecture and Permissions

🚧

Data Explorer Roles

Data Explorer is only available for users with roles of Account Owner, Domains Manager, and Editor.

The queries executed by the Data Explorer are created in the backend by Monte Carlo and dispatched to the customer's Monte Carlo Agent. The agent executes the query and stores the result in the customer-specific object storage. Then it returns a signed URL (with a 5 minutes expiration time) back to Monte Carlo, which serves them back to the browser for download.

Data queried by the Data Explorer does not pass through the Monte Carlo Cloud Service. It's only handled by the Monte Carlo Agent, and the user's browser. For more detail, see our Architecture & Deployment Options.

Care is taken to ensure that Monte Carlo does not run a large or costly query. By default, Monte Carlo will filter for the trailing week's worth of data using a user-selected time field. If Monte Carlo anticipates that this will query too much data (>20 million records), it pre-emptively suggests that the user select a narrower time window or apply a WHERE clause.

If Monte Carlo anticipates that a query will be too large, it will prompt the user to select a shorter time range or apply. WHERE clause.

If Monte Carlo anticipates that a query will be too large, it will prompt the user to select a shorter time range or apply. WHERE clause.

Running the Data Explorer using customer-hosted Object Storage

If you are hosting the object storage deployment, you need to ensure it allows CORS requests from the browser to allow the Data Explorer UI to fetch the query responses.

For S3, the following CORS access policy will allow us to get the data exported by the Data Explorer:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "https://*.getmontecarlo.com"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 3000
    }
]

See https://docs.aws.amazon.com/AmazonS3/latest/userguide/enabling-cors-examples.html for more information.