Data Sampling
Some features in Monte Carlo can temporarily surface rows from your warehouse within the UI, or let users query the warehouse in a way that can return sensitive business metrics.
Collectively, we refer to this set of features as data sampling. For security, compliance, or regulatory reasons, a small subset of customers choose to disable data sampling for their Monte Carlo environment. It is a warehouse-level control that can be switched by a technical member of your Monte Carlo account team.
The data from these features sits in Object Storage. See our Architecture & Deployment Options for details on where Object Storage fits within the broader Monte Carlo architecture.
Features that are unavailable when Data Sampling is disabled
Within Monitoring
- Within SQL Rules:
- "Value-based" SQL rules: unlike count-based SQL rules which wrap the results of the query in a count (helping to obscure any information within), a value-based SQL rule allows the user to return a numeric value. For example, the user may want to check that "sum of sales from yesterdays' orders is > $1,000,000".
- Parameterized values in SQL rules: within count-based SQL Rules, a user can pass values from breached rows directly into notifications. This is done using the syntax {{query_result:field_name}} within the monitor's notes. This is used to help accelerate time-to-resolution. For example, for a SQL Rule that is checking for a certain quality in a table that contains Salesforce opportunities, it is helpful to include in the notification the list of opportunity_id's that had faulty data.
- Test your SQL query: users can test their SQL query to confirm that it will complete successfully. These tests will show the count of rows or value returned by the query.
- Within Validation Monitors:
- Previewing results of "sets": when using the
is in setoris not in setoperators, users can define a set by referring to another field or by writing a query. When testing the set, the user will see a preview of several values from the set.
- Previewing results of "sets": when using the
- Within Monitoring Agent:
- Advanced recommendations: generates intelligent recommendations for custom data quality monitors. When data sampling is disabled, advanced recommendations will not be made, but the Monitoring Agent can continue to surface heuristic (non AI) recommendations, without the use of data samples.
Within Resolution
- Root cause analyses: after certain types of anomalies are detected, Monte Carlo will run follow-up queries that help to identify traits about the erroneous data. These can be used as helpful clues to more quickly identify the root cause of the data issue.
- Troubleshooting agent: when receiving an alert from Monte Carlo, the Troubleshooting Agent can automatically evaluate a wide range of potential root cause hypotheses by leveraging metadata, lineage, query logs, and metrics collected from various integrations for monitoring. The agent’s functionality can be further enhanced when data sampling is enabled. With sampling, the agent does not access sampled data directly. Instead, it uses aggregated statistics derived from sampled data, providing additional context that can enhance its analysis.
- "Breached rows" from SQL rule breaches: after the breach of "count-based" SQL rules, the user can view the specific breached rows within the Monte Carlo UI to better understand the faulty data.
- Metric investigation: users can investigate rows related to a metric anomaly by viewing a sample of underlying data and segmenting those rows by specific fields. This helps surface patterns, correlations, or values that contributed to the anomalous metric, and provides additional context for quicker diagnosis.
Within Assets
- Common values in Data Profiler: when exploring the statistical profile of data within a table, users can view a distribution of the most frequent values for a selected field. This preview surfaces up to the 50 most common values based on the applied filters.
Updated 6 days ago
