CONNECTIVITY Incident Report (Resolved)
Event Description: ACS001 Crash
Event Start Time: 2019-11-22 11:42 PM EST
Event End Time: 2019-11-23 01:29 AM EST
RFO Issue Date: 2019-11-25
Phones registered to Atlanta lost registration. Devices configured with SRV or UDP failed over. Devices configured as TCP or manually registered to core1-atl did not regain registration.
On November 22nd, 2019, at 23:42 EDT, the ACS cluster began crashing repeatedly. Several phones lost registration and the ability to make or receive calls.
Event Timeline (All times 24 hour format, EST)
November 22nd, 2019
- 23:42 First crash reporting by monitoring systems. Phones lost registratration. Notice placed in partner server
- 23:43 Failover verified to SJE and NYJ clusters
June 7th, 2019
- 00:00 Issue verified isolated to Atlanta servers
- 00:25 Rolled back 40.2 updates in case that might have been the cuase of the issue
- 00:57 Services to Atlanta resumed and functional. Endpoints configured for UDP or SRV registered back to Atlanta cluster. Cause yet undetermined
- 01:29 Atlanta cluster remained online. All UDP and SRV phones remained registered. Some reports of call history not functioning. Would continue to investigate in the morning.
- 17:12 Call history page restored. All services functional
Root Cause Analysis
In troubleshooting with NetSapiens and our own senior engineers we determined that malformed TCP packets were causing a crash in the TCP stack. We believe the packets were isolated to a single device but more testing is needed. Normally this would not affect services. However, the repeated crashes and subsequent core dumps quickly filled the server's storage. The eventual full storage prevented normal functions from processing.
Future Preventative Action
While not a permanent fix, the decision was made to block TCP device registration. This affected less than 1% of our total registered devices. Devices that were configured for TCP must be reconfigured to use UDP. We are continuing to work with NetSapiens senior engineer staff to determine how a single errant device could crash the stack. FInally, we have instituted additional safeguards that will immediately move core dump files off server immediately to prevent a full storage again.
Update 11/26/19: Worked with NS engineering to isolate issues to TCP connections in SIP trunks only. SIP trunk TCP functionality was left disabled. TCP functionality for endpoints was restored and devices re-registered successfully. Systems have been stable sense. NS will continue to follow up regarding SIP trunk TCP.