metamap

My latest tenure of 2.5 years is closely related to Designing and Adopting Incident Management Framework (as part of Program Management org). This activity was driven with two primary objectives in mind:

Reach and maintain system uptime of 99.99% (our APIs and SDKs).
Ensure engineering is always firsthand source of information for any potential outage that can result in downtime.

In our foundational days, we lacked a comprehensive alerting and monitoring system. Establishing the Network Operations Center (NOC) Team was our strategic move to shape a robust system and take charge of Incident Management. We not only touched the 99.98% uptime benchmark but also heightened our proactivity from spotting 60% of incidents ahead of our merchants to a resounding 95% and higher.

This post is about the core set of metrics for the Network Operations Center team, with its relation to Incident Management Process. Together with common anti patterns, and measures for improvement.

Metrics that Steered Our Success and Measures to Improve Them

1. First Time to Respond

Context: Rapid response times can make or break product reliability.
Industry Standard: 10-15 minutes.
Our Vector: An ambitious SLA of 1 minute.
Antipatterns: Over-optimizing can stretch the NOC team thin. Sometimes, it’s wiser to slightly breach SLA and strategize better future responses.
Impact: A delayed response can seriously impair the product’s dependability.
Measures for Improvement: Regularly refining our alert sources. The optimal range is 3-5 sources. This involves identifying system bottlenecks, monitoring typical patterns around these, and continuously refining our alerting mechanism to reduce false positives and consolidate dashboards.

2. Time to Acknowledge

Context: The initial acknowledgment sets the path for problem assessment and repair strategies.
Industry Standard: 10-15 minutes.
Our Vector: A 3-minute SLA.
Impact: The acknowledgment speed directly correlates with user trust.
Measures for Improvement: Similar to the first metric, we focused on refining our sources of alerts, ensuring that the NOC team isn’t overwhelmed with too many data points.

3. Time to Assemble

Context: Quick and appropriate team assembly means faster problem-solving.
Industry Standard: 30-45 minutes.
Our Vector: A 15-minute SLA.
Antipatterns: Summoning any team, rather than the right one, can be detrimental.
Impact: Swift and relevant team assembly leads to efficient problem resolution.
Measures for Improvement: Establish clear escalation paths and alert tags. Automation, using tools like PagerDuty with Jira, is essential once alerts have clear ownership and false positives are minimized; Regular training and drills to ensure the team is always prepared. Involving the team in decision-making processes also provides a fresh perspective on the framework.

The Avengers GIF - Find & Share on GIPHY

4. Proactive Engineering Detection Rate

Context: Understanding issues even before they manifest as incidents ensures a platform’s reliability.
Our Metric: The percentage of times engineering identified potential issues before they became incidents, against those reported externally.
Patterns & Impact: A low percentage (<80% for downtime-related incidents) indicates a reactive approach. High proactiveness, as evidenced by our journey assured platform reliability.
Measures for Improvement: Fine-tuning alerting and monitoring, and maintaining transparency and feedback loops with customer-facing teams.

6. Number of Critical False Positives

Context: False positives can drain the productivity and morale of the NOC team. They detract from real issues and can potentially desensitize the team to genuine threats.
Our Metric: At the outset, we grappled with an astounding 40% of critical alerts being false positives. Our relentless push brought this down to a mere 5%.
Antipatterns: Over-alerting can spread the NOC team too thin, with a risk of missing a genuine alert amid the noise.
Impact: Lowering the false positive rate paves the way for scalable and effective automation. A high rate can not only impede automation but also compromise the quality of incident responses. The alert fatigue can cost fintech platforms like ours, similar to Stripe, Plaid, or Square, dearly in terms of both platform reliability and team morale. False positives in alerting might seem innocuous, but they can slowly erode the efficiency of your response mechanism. A disciplined and data-driven approach, much like the one we practiced, can turn this around. It’s not just about the quantity of alerts but the quality, ensuring each alert is actionable, relevant, and steers the platform away from potential disruptions.
Measures for Improvement: We embraced a weekly rigorous analysis of all alerts and escalations. Each stage of the alert funnel was scrutinized to ensure that every alert served a genuine, preventative purpose against potential incidents. This consistent refinement not only brought down false positives but also refined our entire incident management strategy.

Conclusion

Metrics are more than mere numbers; they’re the compass guiding our path to excellence. Within Fintech-domain, serving millions of users these metrics and our proactive steps have been instrumental in delivering a platform that users trust implicitly. Building isn’t just enough; it’s about crafting with insight, dedication, and continuous learning.

#okr #incident #framework #noc #networkoperationscenter #uptime #outage #alerting #monitoring #management #programmanagement #devops

Marat Kiniabulatov | Agile Coach, OKR, PMO

nomadship, kumis and team agility

5 essential NOC Metrics to reach high uptime and detect potential outages