Email for life - Cantab.net

Service News

Undetected Outage [Resolved] + Investigation Results

Dear Members - for a period of about 3 hours [22:35-01:35], one of our servers failed without our monitoring systems detecting a problem. We have determined that this was due to a simultaneous fault occuring in our two independent monitoring systems. The first server was monitoring for actively failed or unhealthy states, requiring data to be reported from the server - things like high CPU load or no running IMAP processes. The problem is the server failed such that no data at all was getting sent to the monitoring server. Our second monitoring server does active availability checks that does not rely on data being sent to it. However it itself had a problem running these otherwise very reliable checks. The combination meant our emergency staff were not alerted. We are making necessary improvements to the monitoring system to ensure this situation does not arise again. Apologies for the inconvenience caused and thank you for your patience. 

Originally Written: 25-Apr-2013 01:32, Last Updated: 06-Nov-2014 12:29

More Service News