Incident Post-Mortem: Toronto Data Center Firewall Update Outage
Summary
Today our services experienced an unplanned partial outage affecting some customers in the Toronto region. The outage was triggered by an unscheduled routing (edge) switch update performed by our data center operators in Toronto, which caused a misconfiguration and knocked the entire suite offline. Our servers automatically rolled over to alternative data centers successfully; however, a segment of customers services, after coming back online, failed to properly handle making or receiving calls. The issue was resolved through manual intervention and system-specific troubleshooting. We apologize for the inconvenience and are committed to improving resilience to prevent future disruptions.
Impact
- Affected Users: Some customers in the Toronto region experienced intermittent connectivity and service disruptions. A subset of users relying on more advanced systems faced issues with making or receiving calls.
- Scope: No data loss occurred, but some users experienced temporary service unavailability or degraded performance, particularly with call-related and AI services.
Root Cause
The primary cause was an unannounced edge switch update by the data center operators, resulting in a misconfiguration that disconnected the Toronto suite affecting parts of our network. The secondary issue with the system failing to handle calls was due to complex interconnection that some VoIP systems require. This system-specific issue persisted after the automatic failover, requiring targeted intervention.
Resolution
- Edge Switch Issue: Onsite technicians manually corrected the firewall misconfiguration, restoring connectivity to the Toronto suite.
- Failover Success: Automatic failover to other data centers mitigated the majority of the outage impact, ensuring continuity for most services.
- Call System Issue: Engineers performed targeted troubleshooting on the affected system, resolving the issue.
Lessons Learned
- Vendor Communication: The lack of advance notification from the data center underscored the need for stricter communication protocols and enforcement with third-party providers. TELAIR posts it’s planned maintenance on our Status Page.
- System-Specific Failover Testing: While the automatic failover worked for most systems, the call functionality issue highlighted gaps in testing specific components in failover scenarios.
- Monitoring Granularity: The delay in identifying the call system issue suggests a need for more granular monitoring of individual system functionalities post-failover.
Action Items
To prevent recurrence and improve resilience:
- Strengthen Vendor SLAs: Update protocols with data center providers to further enforce mandatede advance notice for all planned maintenance.
- Enhance Failover Testing: Conduct comprehensive testing of all critical systems, including call-handling functionalities, in simulated failover scenarios to identify and address potential issues.
- Improve Monitoring: Implement more detailed monitoring for system-specific functionalities, such as call services, to enable faster detection of issues during failover.
- Simulation Drills: Schedule a series of simulated outage events over the next 3 months begining September 20, to validate failover processes and minimize impact in future incidents.
- Resilience Upgrades: Explore additional redundancy for call-handling systems to ensure seamless operation during failover events.
We sincerely apologize for the disruption, particularly for customers affected by the call functionality issue. Our team is dedicated to implementing these improvements to deliver a more reliable experience. For any questions or further assistance, please reach out to our support team.
Thank you for your understanding and continued trust.