CONNECTIVITY  Incident Report (Resolved)

Event Description: ACS001 Crash
Event Start Time: 2020-07-08 14:19 EST
Event End Time: 2020-07-08 14:22 EST
RFO Issue Date: 2020-07-09


Affected Services

Some Phones registered to Atlanta lost registration. Devices configured with SRV or UDP failed over. Devices manually registered to core1-atl did not regain registration.

Event Summary

On July 8th, 2020, the ACS cluster crashed on three separate occasions causing 30 seconds outages. Several phones lost registration and the ability to make or receive calls.

Event Timeline (All times 24-hour format, EST)

July 8th, 2020

  • 14:19 ACS Cluster crashed
  • 14:19 NOC team was notified
  • 14:19 ACS Cluster Restored after 30 seconds
  • 14:19 ACS Cluster crashed
  • 14:19 NOC team was notified
  • 14:19 ACS Cluster Restored after 30 seconds
  • 14:20 NOC team was notified
  • 14:20 First report identified by partner
  • 14:21 ACS Restored along with Registration of Devices.
  • 14:22 Second report identified by partner
  • 15:39 Tier 3 & 4 OIT engineers began an investigation. A report posted to Discord.
  • 15:40 ACS Cluster crashed
  • 15:40 NOC team was notified
  • 15:40 ACS Cluster Restored after 30 seconds

Root Cause Analysis

In troubleshooting with our senior engineers, we determined that the crash was inside the RTP layer of the switch specifically on an object called CRTPRelayTap. This suggests the issue was likely during an audio tap for audio monitoring. 

It has been marked a bug and will be corrected via a patch.

Future Preventative Action

The patch is undergoing testing and is due to be installed on 07/30/2020 during our maintenance window.