Custom Monitors - SQL Rules

This guide explains how to create SQL Rules in Monte Carlo.

Introduction to SQL rules

With this advanced feature, you can write custom SQL to check for anomalies specific to your business. Given the flexibility of SQL, Monte Carlo customers often use these rules to get notified when thresholds for null values, referential integrity, empty strings, etc. are breached.

A list of common use cases and associated SQL templates can be found below .

To view current SQL Rules and create new rules, visit the Monitors tab in the Monte Carlo app.

When creating a new SQL rule, keep the following in mind:

  • The SQL query must take less than 15 minutes to run. We recommend filtering on partitions and indexes to minimize query time.
  • When testing the SQL query from the app, it must take less than 30 seconds to run. You can add a limit to reduce query time, or test the query in your warehouse.
  • The SQL query must be a valid expression that can be run as a subquery. Monte Carlo will wrap your expression in a SELECT COUNT so it only fetches the number of rows returned. This means rules should be written such that the query returns the anomalous rows.

🚧

Samplings

To better help users find root cause for when these rules breach, we now also store a sample of the breached output in your data collector. See additional information here: Samplings.

Setting up a SQL Rule

To create a rule in Monte Carlo, please visit the monitors page, select CREATE NEW MONITOR:

CONFIGURE MONITOR under "SQL rule":

and follow the wizard:

Common SQL rules

The following patterns cover some common scenarios you can address with SQL rules.

📘

How should I use the templates in this article?

  1. Replace any {{table}} and {{column}} placeholders with the table and column names you want to test for. {{value}} placeholders should be replaced with thresholds that makes sense for your domain. The query should return only rows you would consider anomalous, or breaking the rule.
  2. Set the rule to alert when row count is greater than 0.

🚧

Please consider the volume of data you will be processing in your rule

To prevent expensive queries for large tables, we recommend filtering records on a partition (e.g. rows created in the last day).

Referential integrity
All IDs in a child table must be present in a parent table.

WITH child 
     AS (SELECT {{column}} AS id 
         FROM   {{table_1}}), 
     parent 
     AS (SELECT {{column}} AS id 
         FROM   {{table_2}}) 
SELECT * 
FROM   child 
       LEFT JOIN parent using (id) 
WHERE  child.id IS NOT NULL 
       AND parent.id IS NULL

No duplicates
No duplicates in a given column.

SELECT {{column}}
FROM   {{table}}
GROUP  BY {{column}} 
HAVING Count(*) > 1

Label validation
All values in a column are part of a set of valid labels.

SELECT {{column}} 
FROM   {{table}} 
WHERE  {{column}} NOT IN ( '{{value_1}}', '{{value_2}}' )

Null values
There are no records with a NULL value in a given column.

SELECT {{column}} 
FROM   {{table}} 
WHERE  {{column}} IS NULL

Empty strings
There are no records with an empty string in a given column.

SELECT {{column}} 
FROM   {{table}} 
WHERE  TRIM({{column}}) = ''

Minimum and maximum
There are no values that exceed particular thresholds.

SELECT {{column}} 
FROM   {{table}} 
WHERE  NOT( {{column}} > {{value_1}} 
            AND {{column}} <= {{value_2}} )

Null-like strings
No records contain the string "Null" or "None".

SELECT {{column}} 
FROM   {{table}}
WHERE  Lower({{column}}) LIKE 'null' 
        OR Lower({{column}}) LIKE 'none'

Row count match
Two tables have the same number of rows.

WITH t1 
     AS (SELECT Count(*) 
         FROM   {{table_1}}), 
     t2 
     AS (SELECT Count(*) 
         FROM   {{table_2}}), 
     t3 
     AS (SELECT (SELECT * 
                 FROM   t1) - (SELECT * 
                               FROM   t2) AS diff_count) 
SELECT diff_count 
FROM   t3 
WHERE  diff_count != 0

Identical tables
Two tables are identical record for record.

(SELECT * 
 FROM   {{table_1}} 
 {{except_operator}} 
 SELECT * 
 FROM   {{table_2}}) 
UNION ALL 
(SELECT * 
 FROM   {{table_2}} 
 {{except_operator}}  
 SELECT * 
 FROM   {{table_1}})

📘

What should I use for the except operator?

This operator varies per warehouse.

On BigQuery you can use EXCEPT DISTINCT, while EXCEPT works for Redshift and Snowflake.

More than one value
A column has at least two distinct values.

SELECT Count(DISTINCT {{column}}) c 
FROM   {{table}}
HAVING c = 1

Record validation
A SQL expression holds true for all individual records in a table.

SELECT * 
FROM   {{table}} 
WHERE  NOT({{expression}})

📘

What is an example of an expression?

For instance, you can check that the sum of two columns is constant
{{expression}} = {{column_1}} + {{column_2}} = {{value}}

String patterns
All strings in column match a particular pattern, expressed as a regular expression. Can be used to validate common patterns like UUIDs, phone numbers, emails, numbers, escape characters, dates, etc.

Please note:

  • {{regex_function}} will be the appropriate regex matching function for your use case and your data warehouse. For example, to fully match the string with a pattern on Snowflake you would use REGEXP_LIKE, whereas to partially match the string with a pattern on BigQuery, you would use REGEXP_CONTAINS.
  • {{pattern}} can be any regular expression. For example, to check for US zip codes, you might use ^[0-9]{5}(?:-[0-9]{4})?$.
SELECT * 
FROM   {{table}} 
WHERE  NOT({{regex_function}}({{column}}, {{pattern}}))

Samplings

Custom SQL rule output sampling stores the first 500 rows returned by each run of a custom SQL rule to aid with incident investigation. These output samples will only be stored on the data collector and are not sent to Monte Carlo's cloud service. Output sampling will not occur if any of the following are true:

  • The data collector version is too old to support output sampling (requires version 2307 or newer)
  • The customer has opted out of sampling for the warehouse
  • The custom SQL rule has a relative threshold
  • The custom SQL rule has a threshold of 500 or more

Note for data lakes: output sampling for data lakes will add a LIMIT 500 to the query, so the resulting metric will be capped at 500 rows. This means that if the UI displays a value of 500, the unwrapped query could return more than 500 rows. This does not affect rule breach evaluation. This limitation does not apply to custom SQL rules for data warehouses, which report the full row count even with sampling enabled.


Did this page help you?