Query Performance Monitor

Monitor query runtime performance and alert on regressions.

Overview

Catch query performance regressions before they blow your SLAs or warehouse budget. The query performance monitor tracks SQL execution times and alerts on runtime changes relative to a baseline, individual slow queries, or consecutive threshold breaches.

πŸ“˜

Reference scope

This page covers MaC YAML configuration. For how query performance monitoring works, see Query Performance.

Query performance monitors have no schedule field β€” they run on Monte Carlo's internal collection cycle. MaC key: query_performance. This monitor does not target a specific table β€” it uses a selection array to filter which queries are monitored by database, schema, or table.

πŸ“˜

A query performance monitor requires exactly three alert_conditions β€” one each for runtime, runtime_change, and consecutive_breach_count. You tune (or effectively disable, by setting a high threshold) each metric, but all three must be present or apply fails with There must be exactly 3 comparisons.

Quick Start

montecarlo:
  query_performance:
    - name: core_model_runtime
      description: Alert when core dbt model queries slow down significantly
      warehouse: prod-snowflake
      selection:
        - field_name: database
          values:
            - ANALYTICS
        - field_name: schema
          values:
            - CORE
      alert_conditions:
        - metric: runtime
          threshold: 300000
        - metric: runtime_change
          threshold: 2.0
          baseline_agg_function: MEDIAN
          threshold_lookback_minutes: 10080
        - metric: consecutive_breach_count
          threshold: 3
      domains:
        - my-domain

Configuration

alert_conditions β€” performance alert rules

array of objects Β· required

Each entry defines a performance metric and the threshold that triggers an alert. You must supply exactly three entries β€” one each for runtime, runtime_change, and consecutive_breach_count. This schema uses additionalProperties: false β€” only the fields listed below are accepted.

alert_conditions:
  - metric: runtime
    threshold: 300000
  - metric: runtime_change
    threshold: 2.0
    baseline_agg_function: MEDIAN
    threshold_lookback_minutes: 10080
  - metric: consecutive_breach_count
    threshold: 3

Properties


metric β€” performance metric to evaluate

enum Β· required

Accepted values: runtime Β· runtime_change Β· consecutive_breach_count (lowercase)

Each of the three values must appear exactly once across the alert_conditions array. See Available Metrics for threshold meanings and required companion fields.

metric: runtime_change

threshold β€” threshold value

number Β· required

Meaning depends on the metric: multiplier for runtime_change, milliseconds for runtime, count for consecutive_breach_count. Must be greater than 0.

threshold: 2.0

baseline_agg_function β€” baseline aggregation function

enum Β· required, for runtime_change

Accepted values: MEDIAN Β· AVG (uppercase)

Aggregation function for the historical baseline.

baseline_agg_function: MEDIAN

threshold_lookback_minutes β€” lookback window in minutes

integer Β· one of threshold_lookback_minutes or threshold_lookback_data_points is required for runtime_change

Historical lookback window in minutes for computing the baseline. Mutually exclusive with threshold_lookback_data_points β€” set exactly one.

threshold_lookback_minutes: 10080

threshold_lookback_data_points β€” lookback window in data points

integer Β· optional

Number of historical data points for the baseline. Mutually exclusive with threshold_lookback_minutes.

threshold_lookback_data_points: 100

comparison_type β€” comparison type

string Β· optional

Use QUERY_PERFORMANCE if set.

comparison_type: QUERY_PERFORMANCE

operator β€” comparison operator

enum Β· optional

Accepted values: GT

Only GT is supported.

operator: GT
selection β€” query scope filters

array of objects Β· optional

Filters which queries are included in performance monitoring. Each entry narrows the scope by a specific dimension. Multiple entries are combined with AND logic.

selection:
  - field_name: database
    values:
      - ANALYTICS
  - field_name: schema
    values:
      - CORE

Properties


field_name β€” dimension to filter on

string Β· required

Common values: database, schema, table.

field_name: database

values β€” allowed values for the dimension

array of strings Β· optional

Queries matching any value are included.

values:
  - ANALYTICS
  - PRODUCTION

operator β€” filter operator

enum Β· optional Β· default: include

Accepted values: include Β· exclude (lowercase)

Whether matching queries are included in or excluded from the monitored scope.

operator: include
domains β€” domain for this monitor

array of strings (exactly one entry) Β· required, on all accounts created after January 2025

Set default_domain in montecarlo.yml to avoid repeating it on every monitor.

domains:
  - my-domain
warehouse β€” which warehouse to use

string Β· optional (yes if multiple warehouses)

Warehouse UUID or name. Overrides default_resource from montecarlo.yml.

warehouse: prod-snowflake
name β€” unique identifier within the namespace

string Β· required

Required for monitors created after Jan 29, 2024 (existing monitors keep working). Changing the name creates a new monitor and deletes the old one β€” incident history does not transfer.

name: core_model_runtime_regression
description β€” monitor description

string Β· required

Max 512 characters.

description: Detect runtime regressions in core analytics models
notes β€” internal notes

string Β· optional

Visible in the Monte Carlo UI. Not included in notifications.

notes: Owned by the analytics team. Reviewed quarterly.
audiences β€” notification channels

array of strings Β· optional

Audience names linking this monitor to channels defined in Notifications as Code.

audiences:
  - data-engineering-alerts
  - etl-ops
failure_audiences β€” run-failure notification channels

array of strings Β· optional

Separate audiences for run-failure notifications. Falls back to audiences if not set.

failure_audiences:
  - oncall-alerts
notify_run_failure β€” notify on execution errors

boolean Β· optional

notify_run_failure: true
priority β€” incident priority level

enum Β· optional

Accepted values: P1 Β· P2 Β· P3 Β· P4 Β· P5

priority: P2
tags β€” key-value pairs for organizing monitors

array of objects Β· optional

PropertyTypeRequiredDescription
namestringyesTag key
valuestringnoTag value

tags:
  - name: team
    value: analytics
  - name: environment
    value: production
data_quality_dimension β€” data quality category

enum Β· optional

Accepted values: ACCURACY Β· COMPLETENESS Β· CONSISTENCY Β· TIMELINESS Β· UNIQUENESS Β· VALIDITY

data_quality_dimension: TIMELINESS
is_draft β€” create as draft without activating

boolean Β· optional Β· default: false

Creates the monitor in a paused state. Omitting this on a later update resets to false (active) due to PUT semantics β€” always include it if you want the monitor to stay in draft.

is_draft: true
uuid β€” update an existing monitor

string Β· optional

Include the UUID of an existing monitor to update it instead of creating a new one.

uuid: 0dae7702-0950-45c7-909c-7e183bddca19
Deprecated fields
FieldUse instead
resourcewarehouse
domaindomains
domain_uuidsdomains
labelsaudiences
notify_rule_run_failurenotify_run_failure

Available Metrics

All three metrics must be present in every monitor (see the note above). Tune each threshold to the scope you care about.

MetricDescriptionRequired FieldsThreshold Meaning
runtime_changeDetects queries whose runtime has increased relative to a historical baseline.baseline_agg_function, plus either threshold_lookback_minutes or threshold_lookback_data_pointsMultiplier. A threshold of 2.0 means "alert when runtime exceeds 2x the baseline."
runtimeAlerts when any query execution exceeds a fixed duration.None beyond thresholdMilliseconds. A threshold of 300000 means "alert on queries running longer than 5 minutes."
consecutive_breach_countAlerts when a query breaches its threshold for N consecutive runs.None beyond thresholdCount. A threshold of 3 means "alert after 3 consecutive breaches."

Examples

Runtime change detection with median baseline

Tuned to flag queries in the CORE schema that take more than 2x their median runtime over the past week. All three metrics are configured (required); the runtime and consecutive_breach_count thresholds are set permissively so runtime_change drives the alerting.

montecarlo:
  query_performance:
    - name: core_model_runtime_regression
      description: Detect runtime regressions in core analytics models
      warehouse: prod-snowflake
      selection:
        - field_name: database
          values:
            - ANALYTICS
        - field_name: schema
          values:
            - CORE
      alert_conditions:
        - metric: runtime_change
          threshold: 2.0
          baseline_agg_function: MEDIAN
          threshold_lookback_minutes: 10080
        - metric: runtime
          threshold: 600000
        - metric: consecutive_breach_count
          threshold: 5
      audiences:
        - data-engineering-alerts
      priority: P2
      data_quality_dimension: TIMELINESS
      tags:
        - name: team
          value: analytics
      domains:
        - my-domain

Absolute runtime threshold

Tuned to flag any query in the billing schema that takes longer than 10 minutes (600000 ms). The runtime_change and consecutive_breach_count conditions are still required, so they are included with permissive thresholds.

montecarlo:
  query_performance:
    - name: billing_slow_queries
      description: Alert on billing queries exceeding 10 minutes
      warehouse: prod-snowflake
      selection:
        - field_name: database
          values:
            - PRODUCTION
        - field_name: schema
          values:
            - BILLING
      alert_conditions:
        - metric: runtime
          threshold: 600000
        - metric: runtime_change
          threshold: 3.0
          baseline_agg_function: MEDIAN
          threshold_lookback_minutes: 10080
        - metric: consecutive_breach_count
          threshold: 5
      priority: P3
      domains:
        - my-domain

Consecutive breach detection

Tuned to alert after a query breaches its runtime threshold 3 times in a row, reducing noise from one-off spikes. As always, all three metrics are present.

montecarlo:
  query_performance:
    - name: etl_consecutive_slowdowns
      description: Alert after 3 consecutive slow runs in ETL pipeline queries
      warehouse: prod-snowflake
      selection:
        - field_name: database
          values:
            - RAW
      alert_conditions:
        - metric: consecutive_breach_count
          threshold: 3
        - metric: runtime
          threshold: 600000
        - metric: runtime_change
          threshold: 2.0
          baseline_agg_function: MEDIAN
          threshold_lookback_minutes: 10080
      audiences:
        - etl-ops
      priority: P2
      domains:
        - my-domain

Troubleshooting

Scheduling and scope

  • Adding a schedule field. Query performance monitors run on Monte Carlo's internal collection cycle. There is no schedule field. Adding one causes a validation error.
  • Using table instead of selection. Query performance monitors do not have a table field. Use selection with field_name: table to target specific tables.

Alert conditions and metrics

  • Supplying fewer than three alert_conditions. A query performance monitor must define exactly three conditions β€” one each for runtime, runtime_change, and consecutive_breach_count. Any other count fails with There must be exactly 3 comparisons. Tune thresholds permissively for metrics you don't want to drive alerts.
  • Using uppercase metric names or lowercase baseline_agg_function. Metric values are lowercase (runtime, runtime_change, consecutive_breach_count); baseline_agg_function values are uppercase (MEDIAN, AVG). Getting the case wrong fails validation.
  • Omitting baseline_agg_function for runtime_change. This metric requires a baseline aggregation function (MEDIAN or AVG). Omitting it causes a validation error. It also requires one of threshold_lookback_minutes or threshold_lookback_data_points.
  • Setting both threshold_lookback_minutes and threshold_lookback_data_points. These are mutually exclusive. Use one or the other to define the baseline window.
  • Passing extra fields in alert_conditions. The alert conditions schema for query performance monitors has additionalProperties: false. Including a field that belongs to other monitor types (like fields) causes a validation error. Note that operator is a recognized field but only accepts GT here β€” supplying operator: AUTO fails with an enum error, not an additionalProperties one.
  • Setting threshold to 0 or negative. All query performance metrics require a threshold greater than 0.

Other fields and update semantics

  • Wrong tags format. Tags must be objects with name and optional value keys. Writing tags: ["my-tag"] fails validation.
  • Forgetting PUT semantics on updates. When updating a monitor by including uuid, every field you omit reverts to its default β€” it is not left unchanged. Always specify the complete desired configuration.