Marking Alerts as Incidents

Marking an alert as an incident, when appropriate, is best practice for data teams looking to improve data quality and trust across the business. Incidents should be communicated out to stakeholders when appropriate and reviewed on a monthly cadence to determine where gaps in data quality may lie. Furthermore, severity levels of incidents are useful for understanding impact quickly and setting priorities for data teams.

Lifecycle of an Alert

Alert creation

Alerts are triggered with No status.

Acknowledge an alert

There is an option to acknowledge an alert. This sets the status to Acknowledged. An acknowledged alert is being investigated or worked on, but is not yet resolved. It also may not yet be marked as an incident. Typically, the person investigating the alert should be assigned as the owner at this point.

The alternative option is to directly mark this alert as incident. In a single step, this acknowledges the alert and signifies that it has been confirmed as an issue or requires work to resolve. Alternatively, an alert can be marked as an incident after acknowledging it.

Mark as incident

When an alert is confirmed as an issue or requires work to resolve, it can be marked as an incident. Marking an incident will prompt to rename the alert, add a comment, and most importantly, apply a severity. Severity is determined by evaluating the level of impact the incident has on any given stakeholder/customer and the number of stakeholders/customers affected by it. Severity can always be updated on an incident if the assessment changes. An alert may also have had a pre-set Priority which can help inform the Severity of the issue. See more about determining severity below.

Merging into an existing incident

If an alert is determined to be part of an existing incident, you can merge the alert into that incident instead of creating a new one. Merging incidents is ideal for accurately reporting on incidents on the Data operations dashboard.

Note: Merge suggestions are limited to incidents created within a +/- 7-day window (a 14-day range based on the alert's created time).

Resolve

If an alert is not an incident, it can be resolved with two statuses.

Status	Intended purpose
Expected	The detection was a valid anomaly from a statistical standpoint, but was the expected result of something like a pipeline change or planned maintenance.
No action needed	The detection was a valid anomaly from a statistical standpoint, but was not important enough to merit any further action.

Alert statuses do not provide feedback to the models that generate thresholds. Thresholds will not change or adjust based on the alert status provided. Model feedback is managed through a separate process (Tune thresholds).

Severity

Severity is an important tool in reporting data quality and building trust with customers/stakeholders. Severity is determined by evaluating the level of impact the incident has on any given stakeholder/customer and the number of stakeholders/customers affected by it. Severity can always be updated on an incident if the assessment changes.

Customers / stakeholders can be internal or external users of the data platform that rely on it to provide fit-for-purpose data.

Below are some guidelines on how severity could be potentially evaluated, but this is meant to be a starting point that should be iterated on to best fit your business.

Impact / Affected	Minor	Medium	Critical
1-3 Customers / Stakeholders	SEV-4	SEV-2	SEV-2
4-10 Customers / Stakeholders	SEV-3	SEV-2	SEV-2
10+ Customers / Stakeholders	SEV-3	SEV-1	SEV-1

Here are some guidelines to how customer impact should be evaluated:

Impact	Description	Examples
Minor	Causes minor inconvenience, but does not affect the core value of the dataset	- An unused field is missing values - Numbers are off by 1% in internal reports where 10% accuracy is sufficient
Medium	Affects the core value of the dataset mildly or only in some scenarios	- Monthly aggregates are missing data from the past day - 1-2 geographies have inaccurate sales numbers for internal reporting
Critical	Affects the core value of the dataset in a meaningful manner, most of the time	- Reports used for real-time operations is missing data from the past 24 hours - > 20% of user records are missing from a critical fact table

Recommended Metrics

It's recommended to track the following metrics around incident response to hold your team accountable to data quality initiatives. Communicating these metrics to business stakeholders can help build trust around the data platform.

Incidents by Severity: How many incidents are we seeing over time? How severe are they? Are they correlated to specific data products or domains?
Time to Response (TTR): How quickly are we acknowledging alerts? Are we within SLA for acknowledging high severity incidents?
Time to Fixed (TTF): How long does it take to resolve incidents? By severity?