Data Profiler

Data Profiler makes it easy to profile the contents of a table or view. This can be helpful when investigating a data quality issue reported by a business partner, when considering which monitors to create, or when simply getting familiar with the contents of a table.

The experience is interactive and no-code, making it approachable for less technical roles. Users can point and click to adjust the time range of data and filter down for particular segments.

Using Data Profiler

📘

Row Limit

Currently, Data Profiler only allows for 20 million rows to be actively profiled. If more than 20 million rows are in the set, you will be prompted to filter on a smaller amount of time (top right) or apply a custom WHERE clause.

Data Profiler is a tab within the Assets page for a table or view.

When a user loads the Data Profiler tab, it executes queries against the source warehouse to retrieve up to date statistics about the table. See the Architecture section to learn more about how the results of these queries are handled, to ensure data is not stored by Monte Carlo.

Adjust the slider in Row count and click on values in Segments to filter the rest of the data in the dashboard.

Data Profiler contains the following:

  • Filters: By default, a filter is applied for the trailing 7 days on a user-selected time field. The time filter can be changed using the time range selector on the left side. You can further filter the data by applying a custom WHERE clause.
  • Segments: Selecting a field to segment provides a distribution of the field data for the filtered range. Results are limited to the 50 most frequent values. This section is not shown if Data Sampling is disabled in your account.
  • Row count: Histogram of the count of rows, aggregated using a user-selected time field. At the bottom of this section is an easily adjustable slider to shorten or slide the desired time range.
  • Data profile: Common statistical metrics for each field, like the count and % of nulls, count and % uniques. Clicking on the fields will show more details in the Field Overview drawer. Users can also search for field names in the search bar provided in this widget.
  • Recommended monitors: We show monitor recommendations for top 15 fields based on importance or alphabetic order. We sample data of up to 10k rows to recommend monitors.

Field overview

Users can get more details by clicking on the fields in the Data Profile widget.

Field overview drawer

In the Field Overview drawer there are 3 widgets:

  • Overview: The overview table displays various metrics and different patterns detected in the data. For example If the field is a date, patterns detected can be % of dates in past 7 days, or in past 1 month, or % of weekdays etc.
  • Recommended monitors: We sample data of up to 10k rows and recommend monitors for the field selected for Field Overview.
  • Trend: Users can visualize day aggregated metrics like null %, null count, unique % and unique count over the time filter selected in the data profiler. We also show additional day aggregated metrics like mean, min, max and stddev for numeric and string fields.

GenAI monitor recommendations

  1. GenAI monitor recommendations use a data sample of up to 3K rows and 80 columns.
  2. No data is stored in MC at any point.
  3. No data leaves the MC environment at any point.
  4. No data is used to train models for other customers.

Multi-column validations:

Multi-column validation monitor recommendations are based on the following data:

  1. Sample data from the table
  2. Read and write queries running on the table
  3. Field descriptions, labels, metadata if they exist

Recommended monitors are validated on the data before being presented in order to avoid hallucinations.

Types of validations the model currently supports:

  1. categorical VS categorical —> Field_1 = X and Field_2 =/≠ Y (example: if company == MC then employee_quality == High)
  2. XOR filling/nulling —> Field_1 is null/empty and field_2 is not null/empty or vise versa (if manual_threshold is null than auto_threshold is not null and vice versa)
  3. Boolean value VS null/empty —> Field_1 is True/False and Field_2 =/≠ null/empty/etc (example: detecting is True and threshold is not null)
  4. Numeric VS numeric —> Field_1 < Field_2 (sometimes also 3 fields: company_value = stock_price*num_stocks)
  5. Time VS Time —> Field_1 ≥ Field_2 (sometimes also 3 fields: start_time < response_time < end_time)
  6. Is in set —> Field1 is not in [List_of_possible_values]

Segmented metrics:

Segmented metric monitor recommendations are based on the following data:

  1. Sample data from the table
  2. Read and write queries running on the table
  3. Field descriptions, labels, metadata if they exist

Field descriptions, topic and types of usage in queries are than used together with a few other metrics gathered from the data sample to create a scoring formula. The field's value cardinality is also used in ranking the recommendations. The top scored field is than given as the suggested segmentation field for volume and freshness monitors.

Regex:

These recommendations are created using only a sample of the data:

The sample is passed to an agent loop structure that passes through the following stages:

  1. GenAI creates a regex suggestion based on sample of the data.
  2. Validate accuracy of the regex on all the data to see that we catch it all + run it on a “bad” dataset created from mixing the original string in multiple ways in order to see that its not too general - I.E .* for example.
  3. Pass examples of good and caught, good and not caught, bad and caught, bad and not caught to a reflection agent which explains why the regex is too specific and why it is too general and what should be done to fix these issues.
  4. Pass the reflection with the examples to a fixer agent together will all previous regex tested to create a new “better” version.
  5. Validate again (step 2). Once validation passes with catching all good example and no bad examples we return the result.

Note that these suggestions are only available in field overview.

Architecture and Permissions

🚧

Data Profiler Roles

Data Profiler is only available for users with roles of Account Owner, Domains Manager, and Editor.

The queries executed by the Data Profiler are created in the backend by Monte Carlo and dispatched to the customer's Monte Carlo Agent. The agent executes the query and returns the query response back to our backend which is then returned to the browser.

Care is taken to ensure that Monte Carlo does not run a large or costly query. By default, Monte Carlo will filter for the trailing week's worth of data using a user-selected time field. If Monte Carlo anticipates that this will query too much data (>20 million records), it pre-emptively suggests that the user select a narrower time window or apply a WHERE clause.

If Monte Carlo anticipates that a query will be too large, it will prompt the user to select a shorter time range or apply. WHERE clause.

If Monte Carlo anticipates that a query will be too large, it will prompt the user to select a shorter time range or apply. WHERE clause.