Issues in our Toronto Data Center

Incident Report for TELAIR Status Page

Postmortem

Incident Post-Mortem: Toronto Data Center Firewall Update Outage

Summary

Today our services experienced an unplanned partial outage affecting some customers in the Toronto region. The outage was triggered by an unscheduled routing (edge) switch update performed by our data center operators in Toronto, which caused a misconfiguration and knocked the entire suite offline. Our servers automatically rolled over to alternative data centers successfully; however, a segment of customers services, after coming back online, failed to properly handle making or receiving calls. The issue was resolved through manual intervention and system-specific troubleshooting. We apologize for the inconvenience and are committed to improving resilience to prevent future disruptions.

Impact

  • Affected Users: Some customers in the Toronto region experienced intermittent connectivity and service disruptions. A subset of users relying on more advanced systems faced issues with making or receiving calls.
  • Scope: No data loss occurred, but some users experienced temporary service unavailability or degraded performance, particularly with call-related and AI services.

Root Cause

The primary cause was an unannounced edge switch update by the data center operators, resulting in a misconfiguration that disconnected the Toronto suite affecting parts of our network. The secondary issue with the system failing to handle calls was due to complex interconnection that some VoIP systems require. This system-specific issue persisted after the automatic failover, requiring targeted intervention.

Resolution

  • Edge Switch Issue: Onsite technicians manually corrected the firewall misconfiguration, restoring connectivity to the Toronto suite.
  • Failover Success: Automatic failover to other data centers mitigated the majority of the outage impact, ensuring continuity for most services.
  • Call System Issue: Engineers performed targeted troubleshooting on the affected system, resolving the issue.

Lessons Learned

  • Vendor Communication: The lack of advance notification from the data center underscored the need for stricter communication protocols and enforcement with third-party providers. TELAIR posts it’s planned maintenance on our Status Page.
  • System-Specific Failover Testing: While the automatic failover worked for most systems, the call functionality issue highlighted gaps in testing specific components in failover scenarios.
  • Monitoring Granularity: The delay in identifying the call system issue suggests a need for more granular monitoring of individual system functionalities post-failover.

Action Items

To prevent recurrence and improve resilience:

  • Strengthen Vendor SLAs: Update protocols with data center providers to further enforce mandatede advance notice for all planned maintenance.
  • Enhance Failover Testing: Conduct comprehensive testing of all critical systems, including call-handling functionalities, in simulated failover scenarios to identify and address potential issues.
  • Improve Monitoring: Implement more detailed monitoring for system-specific functionalities, such as call services, to enable faster detection of issues during failover.
  • Simulation Drills: Schedule a series of simulated outage events over the next 3 months begining September 20, to validate failover processes and minimize impact in future incidents.
  • Resilience Upgrades: Explore additional redundancy for call-handling systems to ensure seamless operation during failover events.

We sincerely apologize for the disruption, particularly for customers affected by the call functionality issue. Our team is dedicated to implementing these improvements to deliver a more reliable experience. For any questions or further assistance, please reach out to our support team.

Thank you for your understanding and continued trust.

Posted Sep 15, 2025 - 22:11 EDT

Resolved

While we continue to monitor for the next 24 hours, we can confirm no connecvity issues reported from our Toronto Data center, in which we can resolve this issue.

All services are online 100% and responding well.

We will post an RFO (Reason for Outage) on this incodent page within 24 hours.
Posted Sep 15, 2025 - 12:54 EDT

Update

Continuing to monitor for stability. Additional and final sync expected to occur around 12:30PM EST should no connectivity issues be detected before then. Updates to follow.
Posted Sep 15, 2025 - 11:55 EDT

Update

Some connectivity still reporting as intermittent. Updates to follow...
Posted Sep 15, 2025 - 11:21 EDT

Monitoring

All services restored and mitigated. We are monitoring all services. Updates to follow...
Posted Sep 15, 2025 - 11:00 EDT

Update

Services are recovering across our backup regions. Some client in the Toronto region still may experience intermittent issues as services continue to restore. Once services restore back into the primary and an ROE provided by our Datacenter provider, we will share in this incident.
Posted Sep 15, 2025 - 09:49 EDT

Update

Some services have already mitigated. Awaiting mitigation in regions CA1 and CA2 for TRITON series systems. Updates to follow...
Posted Sep 15, 2025 - 08:33 EDT

Update

We are continuing to work on a fix for this issue.
Posted Sep 15, 2025 - 07:37 EDT

Identified

Network issues are reported from the Data Center providers, and they have reported work has already begun. No ETA, updates to follow...
Posted Sep 15, 2025 - 06:59 EDT

Investigating

We are aware that many services are currently unavailable and are working with out Data Center provider.
Posted Sep 15, 2025 - 06:51 EDT
This incident affected: Voice (Outbound calling, Inbound calling, Messaging (SMS/MMS)), Web (PBX Portal), and Email.