I do have a Grafana alert set up for "the CPU has been slammed solid for more than an hour", but it turns out the logic for it was broken so the alert never got sent.
going through my metrics, I can see that my average power consumption on the server rack was elevated by roughly 2kWh/day for the past two days, so this bug probably cost me about £1 in electricity.
Comments
Displaying 0 of 1 comments
Graham Sutherland / Polynomial
from what I can tell, the middleware bug is something to do with the contents of /dev changing during the execution of a cleanup script that runs periodically, which would explain why it's a rare edge-case.
looking through the logs it might've been a HBA hiccup because it did complain about something on /dev/da1, but it's hard to line up the timing because I don't exactly know when the script started.
I just found the actual answer to this. /etc/periodic/security/ has two periodic scripts that by default run daily: 100.chksetuid and 110.neggrpperm
by default (/etc/defaults/periodic.conf) these are enabled and configured to run daily. these scripts scan your system for files that have insecure setuid and negative group permissions, using `find`.
the problem is that this gets run *per jail* and if the jails mount large datasets it eats a ton of CPU time for several hours at a time.
by Graham Sutherland / Polynomial ;
Likes: 0
Replies: 1
Boosts: 0