Marking Alerts as Incidents (beta)

Marking an alert as an incident, when appropriate, is best practice for data teams looking to improve data quality and trust across the business. Incidents should be communicated out to stakeholders when appropriate and reviewed on a monthly cadence to determine where gaps in data quality may lie. Furthermore, severity levels of incidents are useful for understanding impact quickly and setting priorities for data teams.

Lifecycle of an Alert

Alert creation

Alerts are triggered with No status.

Acknowledge an alert

There is an option to acknowledge an alert. This sets the status to Acknowledged. An acknowledged alert is being investigated or worked on, but is not yet resolved. It also may not yet be marked as an incident. Typically, the person investigating the alert should be assigned as the owner at this point.

The alternative option is to directly mark this alert as incident. In a single step, this acknowledges the alert and signifies that it has been confirmed as an issue or requires work to resolve. Alternatively, an alert can be marked as an incident after acknowledging it.

Mark as incident

When an alert is confirmed as an issue or requires work to resolve, it can be marked as an incident. Marking an incident will prompt to rename the alert, add a comment, and most importantly, apply a severity. Severity is determined by evaluating the level of impact the incident has on any given stakeholder/customer and the number of stakeholders/customers affected by it. Severity can always be updated on an incident if the assessment changes. An alert may also have had a pre-set Priority which can help inform the Severity of the issue. See more about determining severity below.

Resolve

If an alert is not an incident, it can be resolved with three statuses.

StatusIntended purposeImpacts ML Models?
ExpectedThe detection was a valid anomaly from a statistical standpoint, but was the expected result of something like a pipeline change or planned maintenance.No
No action neededThe detection was a valid anomaly from a statistical standpoint, but was not important enough to merit any further action.No
False positiveThe detection was not a valid statistical anomaly. Sometimes, this can be the result of a data collection issue from Monte Carlo.Yes

An an alert that has been marked as an incident can be resolved as Fixed. If this incident was generated from anomaly detection, this will also remove the anomaly from the training data.

Note: If an alert from an anomaly is not an incident (i.e. an intentional deletion of data that triggers a volume anomaly alert), but you desire to keep alerting about future similar anomalies it is best practice to mark this alert as Fixed in order to remove the anomaly from the training data. This is currently being further evaluated for a better workflow. See more info at Alert Statuses.

Severity

Severity is an important tool in reporting data quality and building trust with customers/stakeholders. Severity is determined by evaluating the level of impact the incident has on any given stakeholder/customer and the number of stakeholders/customers affected by it. Severity can always be updated on an incident if the assessment changes.

Customers / stakeholders can be internal or external users of the data platform that rely on it to provide fit-for-purpose data.

Below are some guidelines on how severity could be potentially evaluated, but this is meant to be a starting point that should be iterated on to best fit your business.

Impact / AffectedMinorMediumCritical
1-3 Customers / StakeholdersSEV-4SEV-2SEV-2
4-10 Customers / StakeholdersSEV-3SEV-2SEV-2
10+ Customers / StakeholdersSEV-3SEV-1SEV-1

Here are some guidelines to how customer impact should be evaluated:

ImpactDescriptionExamples
MinorCauses minor inconvenience, but does not affect the core value of the dataset- An unused field is missing values - Numbers are off by 1% in internal reports where 10% accuracy is sufficient
MediumAffects the core value of the dataset mildly or only in some scenarios- Monthly aggregates are missing data from the past day - 1-2 geographies have inaccurate sales numbers for internal reporting
CriticalAffects the core value of the dataset in a meaningful manner, most of the time- Reports used for real-time operations is missing data from the past 24 hours - > 20% of user records are missing from a critical fact table

Recommended Metrics

It's recommended to track the following metrics around incident response to hold your team accountable to data quality initiatives. Communicating these metrics to business stakeholders can help build trust around the data platform.

  • Incidents by Severity: How many incidents are we seeing over time? How severe are they? Are they correlated to specific data products or domains?
  • Time to Response (TTR): How quickly are we acknowledging alerts? Are we within SLA for acknowledging high severity incidents?
  • Time to Fixed (TTF): How long does it take to resolve incidents? By severity?