Data Profiler

Data Profiler makes it easy to profile the contents of a table or view. This can be helpful when investigating a data quality issue reported by a business partner, when considering which monitors to create, or when simply getting familiar with the contents of a table.

The experience is interactive and no-code, making it approachable for less technical roles. Users can point and click to adjust the time range of data and filter down for particular segments.

Using Data Profiler

📘

Row Limit

Currently, Data Profiler only allows for 20 million rows to be actively profiled. If more than 20 million rows are in the set, you will be prompted to filter on a smaller amount of time (top right) or apply a custom WHERE clause.

Data Profiler is a tab within the Assets page for a table or view.

When a user loads the Data Profiler tab, it executes queries against the source warehouse to retrieve up to date statistics about the table. See the Architecture section to learn more about how the results of these queries are handled, to ensure data is not stored by Monte Carlo.

Adjust the slider in Row count and click on values in Segments to filter the rest of the data in the dashboard.

Data Profiler contains the following:

  • Filters: By default, a filter is applied for the trailing 7 days on a user-selected time field. The time filter can be changed using the time range selector on the left side. You can further filter the data by applying a custom WHERE clause.
  • Segments: Selecting a field to segment provides a distribution of the field data for the filtered range. Results are limited to the 50 most frequent values. This section is not shown if Data Sampling is disabled in your account.
  • Row count: Histogram of the count of rows, aggregated using a user-selected time field. At the bottom of this section is an easily adjustable slider to shorten or slide the desired time range.
  • Data profile: Common statistical metrics for each field, like the count and % of nulls, count and % uniques. Clicking on the fields will show more details in the Field Overview drawer. Users can also search for field names in the search bar provided in this widget.
  • Recommended monitors: We show monitor recommendations for top 15 fields based on alphabetic order or importance. We sample data of upto 10k rows to recommend monitors.

Field Overview

Users can get more details by clicking on the fields in Data Profile widget.

Field overview drawer

In the Field Overview drawer there are 3 widgets:

  • Overview: The overview table displays various metrics and different patterns detected in the data. For example If the field is a date, patterns detected can be % of dates in past 7 days, or in past 1 month, or % of weekdays etc.
  • Recommended monitors: We sample data of up to 10k rows and recommend monitors for the field selected for Field Overview.
  • Trend: Users can visualize day aggregated metrics like null %, null count, unique % and unique count over the time filter selected in the data profiler. We also show additional day aggregated metrics like mean, min, max and stddev for numeric and string fields.

Architecture and Permissions

🚧

Data Profiler Roles

Data Profiler is only available for users with roles of Account Owner, Domains Manager, and Editor.

The queries executed by the Data Profiler are created in the backend by Monte Carlo and dispatched to the customer's Monte Carlo Agent. The agent executes the query and returns the query response back to our backend which is then returned to the browser.

Care is taken to ensure that Monte Carlo does not run a large or costly query. By default, Monte Carlo will filter for the trailing week's worth of data using a user-selected time field. If Monte Carlo anticipates that this will query too much data (>20 million records), it pre-emptively suggests that the user select a narrower time window or apply a WHERE clause.

If Monte Carlo anticipates that a query will be too large, it will prompt the user to select a shorter time range or apply. WHERE clause.

If Monte Carlo anticipates that a query will be too large, it will prompt the user to select a shorter time range or apply. WHERE clause.