With any type of storage failures and errors happen. Depending on the setup and the their extent such failures and errors might be anything from being just annoying or absolutely critical. However, they all have one thing in common -- they need to be properly reported so that counter-actions and fixups can happen.
What does properly mean here? The failures and errors need to be reported:
- in a visible place,
- in a clear unambiguous format that can be understood by both humans and machines to allow both manual and automated reactions,
- in a way making sure they are persistent and archived,
- together with timestamps simplifying (sometimes even allowing) analysis.
Ideally, there should be some common method of such reporting so that it's easy to get all the storage-related events and distinguish them from other events.
Is very suboptimal (to put it mildly). Every technology doing some kind of monitoring and reporting of storage events uses its own way (by default). MD RAID sends an email to root@localhost and logs messages to syslog, LVM (or dmeventd more precisely) logs messages to syslog, smartd logs messages to syslog, prints warnings to console(s) and is also listened to by udisks providing DBus signals for SMART events, etc (no idea about Btrfs, pointers welcome!). So all the reports are in visible places, persistent and archived and have timestamps. However, they are far from having an unambiguous format that can be understood by both humans and machines. MD RAID provides a way to run an executable on every event which may be used for automation (so does dmeventd, but only for a very limitted set of events).
There's again one common thing -- all the reports are logged. Nevertheless, it's nearly impossible to get all storage-related events and distinguish them from other events in the logs because they are more or less just free-form text.
It's obvious that to make getting all storage-related reports easy, they need to be in one place. Which is already happening -- they are all being logged. However, in a way that makes them hard to distinguish from other messages and basically impossible to further automatically analyze and process.
Fortunately, modern systems are using journald for logging which supports the so called structured logging. In other words, logging where a log entry is not just a plain free-form text message, but a set of key-value pairs (with values potentially being arbitrary BLOBs). So a trivial solution would be to add an extra field TYPE=storage-event to all messages for storage events being logged (with MESSAGE being the standard field for the free-form text) and then use the field to filter logs for storage event reports. And similarly, other extra fields can be used to provide machine-readable information like DEVICE, STATE,... as well as tips for human administrators like links to documentation or man pages.
Since this is a very common (use) case not only for storage-related events reporting, journal has a very nice builtin support for all this. There are a few predefined/well-known fields and a few automatically-computed fields added by journald itself, but the logging process can pass arbitrary extra fields on its own even with multiple key-value pairs having the same key (and thus multiple values for the key). And for the TYPE field it has a predefined key with a rather confusing name MESSAGE_ID which is not an ID of the given message but an ID of the message's type/category/... The value of MESSAGE_ID is supposed to be a 128bit UUID written as hexadecimal numbers without any separators. Creating a new such UUID can be done with:
# journalctl --new-id128
which also returns code chunks for C and Python to work with the provided UUID value.
Moreover, journal also supports the so called catalogs. A catalog entry defines a verbose email-like template that can be used to generate a nice message for a given MESSAGE_ID using the values from the log entry. For a storage related event it could look like this:
-- 3183267b90074a4595e91daef0e01462 Subject: Storage event reported by @SOURCE@ Defined-by: storage-event-reporting Documentation: man:storage-event-reporting(7) Documentation: man:@SOURCE_MAN@ Event: @DEVICE@ is @STATE@: @DETAILS@. Event-level: @PRIORITY@ (@PRIORITY_DESC@) In case the device is failing or failed, or if the event level is 4 (warning) or lower (more significant), an immediate action may be needed to prevent system malfunction!
where the first line is the MESSAGE_ID and Subject:, Defined-by and
Documentation items (like email headers) are predefined and recommended. A
concrete message can then look like this (output from journalctl -xe
MESSAGE_ID=3183267b90074a4595e91daef0e01462
):
Mar 24 11:27:15 localhost.localdomain logger[28370]: SMART is reporting too many ECC failures on /dev/sda -- Subject: Storage event reported by smartd -- Defined-by: storage-event-reporting -- Documentation: man:storage-event-reporting(7) -- Documentation: man:smartd(8) -- -- Event: /dev/sda is failing: Too many ECC failures. -- Event-level: 4 (warning) -- -- In case the device is failing or failed, or if the event level is -- 4 (warning) or lower (more significant), an immediate action may be needed -- to prevent system malfunction!
The automated processing would of course need the raw key-value pairs, but it's those what is stored in the journal of course. The above verbose message is just generated on demand based on the template. So for example udisks could easily just monitor the journal filtering the messages based on the above MESSAGE_ID and use the key-value pairs to generate DBus signals (using the journal API for all that not text processing, of course). And those DBus signals could be connected to e.g. GNOME Shell to inform user that something is going on (as is now already happening for warnings/errors coming from smartd and "forwarded" by udisks). Of course, not all messages need to be errors and failures, there are harmless and good things happening too like thin pool being grown in the VG (with some free space) on demand or a multipath path coming back "online".
Once the Storage Instantiation Daemon is implemented and running on
systems it can be used to get a better picture of what's affected by the
event. It will have a tree of block devices with all their dependencies. So
knowing there's a problem on e.g. /dev/sda
SID will be able to tell
that it is for example actually a backing device for a LUKS mapping which
is a member of an MD RAID device /dev/md127
which in turn is being
used as a PV in VG system
that has LVs root
, swap
and
home
. This may sound like a crazy complex example, but is actually a
common scenario/setup.
This is how a storage event report can be sent to journal from shell:
$ printf 'MESSAGE_ID=3183267b90074a4595e91daef0e01462\nDETAILS=Too many ECC failures\nDEVICE=/dev/sda\nSTATE=failing\nMESSAGE=SMART is reporting too many ECC failures on /dev/sda\nSOURCE_MAN=smartd(8)\nSOURCE=smartd\nPRIORITY=4\nPRIORITY_DESC=warning' |logger --journald
and from Python:
import systemd.journal systemd.journal.send("SMART is reporting too many ECC failures on /dev/sda", MESSAGE_ID="3183267b90074a4595e91daef0e01462", DETAILS="Too many ECC failures", DEVICE="/dev/sda", STATE="failing", SOURCE_MAN="smartd(8)", SOURCE="smartd", PRIORITY="4", PRIORITY_DESC="warning")
To get all such reports later:
$ journalctl -xe MESSAGE_ID=3183267b90074a4595e91daef0e01462
or from Python:
reader = systemd.journal.Reader() reader.this_boot() reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462") for _entry in reader: ...: print(reader.get_catalog()) # prints the generated string
To get the actual fields in Python:
reader = systemd.journal.Reader() reader.this_boot() reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462") for entry in reader: print(entry) # entry is a dict
To watch journal for changes and process them:
state = reader.wait(-1) while state > 0: # state < 0 means error # could also be INVALID -> should create new Reader() if state == systemd.journal.APPEND: print(reader.get_next()) state = reader.wait(-1)
And the best thing about the above Python examples is that a C code doing the same looks almost the same -- even the C API provides these high-level functions and a single "object" to operate on.
Added some examples demonstrating the API.