News: Postmortem for event on November 16th
Servers: Sunfire, Pixel, Blizzard, Safari, Echo, Shadow, Lucy, Taylor, London, Longhorn
In the early hours of November 16th (US/Central) the above mentioned servers all experienced performance issues in the range of 2-4 hours, leading up to complete failure and requiring reboots and emergency maintenance.
Some days before, code had been deployed which had the simple purpose of cleaning out specific bounce emails that we did not want to send, but did not want stuck in queue for the full retry time. These, specifically, are Facebook notification emails and Google DMARC reports. These emails often fail in forwarding (because Google declines both of these emails when forwarded to Gmail), and we don't want them sitting in queue. We also don't want to bounce back messages to Facebook and Google for these emails, because they do not accept them. This is why we want to regularly clean them out of the queue as they might mask more important issues.
The code which was deployed had two points of logic:
1. If the script isn't a cron job, add itself to cron.
2. Run the cleanup procedure.
A human error in the logic caused the script to add itself to the crontab every hour. It would then run at the top of every hour, the number of times that it was listed in the crontab. This wasn't a big deal until it had added itself to the crontab a few thousand times, ultimately leading to a run of the crontab at the top of an hour which consumed all available memory on each of these servers, causing them to dip into swap space. While our monitoring did not report the servers as down, and they were techncally online and functional to some degree, they were either incredibly slow or virtually inaccessible in various points. Some servers saw DirectAdmin fail to load, others saw webmail fail to load. One server, Blizzard, saw over 1000 emails stuck in queue waiting for load to drop (these emails were processed fine afterward).
The lesson here is that one line of code can do a lot of damage, even on a seemingly innocent and simple task. It was resolved, and it won't be allowed to happen again.