Using Monte Carlo's Incident IQ

This tutorial will teach you how to use Monte Carlo's Incident IQ for easy triaging, root cause analysis, and downstream impact of anomalies detected by Monte Carlo.

🚧

Last Updated: April 14, 2022

Transcript

Hello, and welcome to another addition of Monte Carlo university! Today, I'll be walking through the incident IQ page. This is where you'll go to understand more about an incident as it occurs, as well as understand the potential downstream impact and understand and take the first step towards, root cause analysis.

Incident IQIncident IQ

Incident IQ

To begin on the left hand side, we can actually see a high level incident summary. This is where you can see the current status as well as update the status of the incident itself, as well as see the number of impacted events and the number of key assets. On top of that, we're able to see or update the ownership information as well as severity of this specific incident:

Incident SummaryIncident Summary

Incident Summary

In the middle here, you're also able to see the impact radius. The impact radius here will show you the number of queries potentially impacted by this incident, as well as the number of actual end users that are querying this table that could be impacted. Within this dropdown, you'll be able to see those users as well as the number of queries that they've recently made on this table, or number of tables. If you're connected to a downstream BI report, you'll also be able to see the number of BI users as well as reports potentially impacted downstream:

Impact RadiusImpact Radius

Impact Radius

Warehouses Users and Queries Impacted by IncidentWarehouses Users and Queries Impacted by Incident

Warehouses Users and Queries Impacted by Incident

Within the root cause analysis section, you're able to quickly jump into sampling queries, as well as understanding a little bit more about the query logs. If I click here on the query logs, we're also able to then see the updates made to this table over time, and when the actual freshness incident in this example occurred, and how those queries change, and quickly see that the comparison between these two queries based on the length of the queries. On top of that, you can see the queries made on this table as well. In this case, we have data bricks as well as DBT users querying this table:

Updates made over timeUpdates made over time

Updates made over time

Queries on the tableQueries on the table

Queries on the table

As I jump down a little further, we can now actually get a little bit into the actual incident itself. In this specific case, we can see that this table was updated typically on about an hourly basis. And then we change the cadence of updates here typically to around every six hours. Because of this change and what we had seen historically, we surface this as a freshness incident the first time this change was made. As you can also see, we did not flag this going forward because we began to learn using our machine learning that this is the new expectation.

On the left hand side, I'm also able to see the incident timeline, which will show me the time that this incident actually occurred, as well as which table it occurred on. I can also then see any updates successively added to this based on the status update, as well as if I were to comment here on this specific incident, I would also be able to see that comment added here to this timeline.

Freshness HistoryFreshness History

Freshness History

Incident TimelineIncident Timeline

Incident Timeline

As I scroll down below the actual incident, we can also see here on the right hand side, we can see the actual lineage for this impacted table both directly upstream and downstream. So I can see which tables are actually feeding data into this table, as well as where downstream this data goes and what the potential impact is there. If there are any downstream BI reports that could potentially be impacted by this data delay, those would show here under the reports affected section. I can also see if there have been any historical incidents on this table and when they occurred and what type of incidents they are. Lastly, if I have custom monitors set up on this table, I would be able to see those listed here under the active monitor session.

Lineage of Impacted TableLineage of Impacted Table

Lineage of Impacted Table

As a next step, you're then able to actually "Sample" the data within this specific table to understand when the new newest data, this case exists on the table based on the different timestamps. As I can see here, the most recent data for each one of these timestamps and when in the past those were. Lastly, I'm able to jump into pipelines if I want to get a holistic picture of the lineage for this specific table downstream, as well as jump into the catalog section if I wanna know a little bit more about this table itself.

Live FreshnessLive Freshness

Live Freshness

I hope this was helpful and please feel free to reach out to [email protected] or the chat bot in the lower right hand corner if you have any questions!


Did this page help you?