Data profiler

Data profiler makes it easy to profile the contents of a table. This can be helpful when considering which monitors to create, investigating a data quality issue, or simply getting familiar with the contents of a table.

Supported warehouses: Snowflake, Databricks, BigQuery, Redshift, SQL Server, and Teradata.

Using Data profiler

📘
Row Limit
Currently, Data profiler only allows for 20 million rows to be profiled. If more than 20 million rows are in the set, you will be prompted to filter on a smaller amount of time or apply a custom WHERE clause.

Data profiler contains the following:

Run profile form: Users select filters that will be applied when the profile is run. By default, a filter is applied for the trailing 7 days on a user-selected time field. The time filter can be changed using the time range selector. You can further filter the data by applying a custom WHERE clause.
Profile summary: The profile summary shows the currently applied filters. The filters can be edited through the Edit profile modal. Updating filters will re-run the profile with the new filters.
Data profile: Common statistical metrics for each field, like the count and % of nulls, count and % uniques. Clicking on the fields will show more details in the Field Overview drawer. Users can also search for field names in the search bar provided in this widget.
Monitoring agent recommendations: We show monitoring agent recommendations for top 15 fields based on importance or alphabetic order. We sample data of up to 10k rows to recommend monitors.
Row count: Histogram of the count of rows, aggregated using a user-selected time field. At the bottom of this section is an easily adjustable slider to shorten or slide the desired time range.

Field overview

Users can get more details by clicking on the fields in the Data profile widget.

In the Field Overview drawer there are 3 widgets:

Overview: The overview table displays various metrics and different patterns detected in the data. For example If the field is a date, patterns detected can be % of dates in past 7 days, or in past 1 month, or % of weekdays etc.
- Common values: Fetching common values provides a distribution of the field data for the filtered range. Results are limited to the 50 most frequent values. This section is not shown if Data Sampling is disabled in your account.
Monitoring agent recommendations: We sample data of up to 10k rows and recommend monitors for the field selected for Field Overview.
Trend: Users can visualize day aggregated metrics like null %, null count, unique % and unique count over the time filter selected in the data profiler. We also show additional day aggregated metrics like mean, min, max and stddev for numeric and string fields.

Monitoring agent recommendations

Monitoring agent recommendations use a data sample of up to 3K rows and 80 columns.
No data is stored in MC at any point.
No data leaves the MC environment at any point.
No data is used to train models for other customers.

Multi-column validations:

Multi-column validation monitor recommendations are based on the following data:

Sample data from the table
Read and write queries running on the table
Field descriptions, labels, metadata if they exist

Recommended monitors are validated on the data before being presented in order to avoid hallucinations.

Types of validations the model currently supports:

categorical VS categorical —> Field_1 = X and Field_2 =/≠ Y (example: if company == MC then employee_quality == High)
XOR filling/nulling —> Field_1 is null/empty and field_2 is not null/empty or vise versa (if manual_threshold is null than auto_threshold is not null and vice versa)
Boolean value VS null/empty —> Field_1 is True/False and Field_2 =/≠ null/empty/etc (example: detecting is True and threshold is not null)
Numeric VS numeric —> Field_1 < Field_2 (sometimes also 3 fields: company_value = stock_price*num_stocks)
Time VS Time —> Field_1 ≥ Field_2 (sometimes also 3 fields: start_time < response_time < end_time)
Is in set —> Field1 is not in [List_of_possible_values]

Segmented metrics:

Segmented metric monitor recommendations are based on the following data:

Sample data from the table
Read and write queries running on the table
Field descriptions, labels, metadata if they exist

Field descriptions, topic and types of usage in queries are than used together with a few other metrics gathered from the data sample to create a scoring formula. The field's value cardinality is also used in ranking the recommendations. The top scored field is than given as the suggested segmentation field for volume and freshness monitors.

Regex:

These recommendations are created using only a sample of the data:

The sample is passed to an agent loop structure that passes through the following stages:

The Monitoring Agent creates a regex suggestion based on sample of the data.
Validate accuracy of the regex on all the data to see that we catch it all + run it on a “bad” dataset created from mixing the original string in multiple ways in order to see that its not too general - I.E .* for example.
Pass examples of good and caught, good and not caught, bad and caught, bad and not caught to a reflection agent which explains why the regex is too specific and why it is too general and what should be done to fix these issues.
Pass the reflection with the examples to a fixer agent together will all previous regex tested to create a new “better” version.
Validate again (step 2). Once validation passes with catching all good example and no bad examples we return the result.

Note that these suggestions are only available in field overview.

Using Data profiler

📘Row Limit

Field overview

Monitoring agent recommendations

Multi-column validations:

Types of validations the model currently supports:

Segmented metrics:

Regex:

📘
Row Limit