Monitor HDD/SSD health #92
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
All hosts should do filesystem scrubbing, SMART-testing or similar tests periodically to detect failing disks, potentially even before they cause any data loss.
Any failing drives and/or filesystems should raise an alert to Drift (On e-mail, Matrix, etc.).
I've been having a look at smartd in nixpkgs, and it seems to be relatively straight forward to set up. However, it's currently built without it's systemd integration upstream. Maybe we should dogfood a patch until we've been able to upstream it?
nixpkgs has a custom script for notifying via any combination of email, systembus-notify, wall and xmessage, but it's probably most reasonable to just use email. That means we need to set up MTAs on our nixos machines however.
Not sure about the state of smartd on debian/freebsd, but we seem to already have some kind of cronjobs running there?
Alternatively, something like https://github.com/matusnovak/prometheus-smartctl could also be considered. Doesn't need to be mutually exclusive with the email notifications, but if we set up alerts in grafana it might become redundant.
The NixOS machines are now fine, but the Debian machines still need to implement the mail notification system.