vpodzime/storage_events.rst

## storage_events.rst

      
    Raw
  

              storage_events.rst
            
          
    Reporting and monitoring storage events

With any type of storage failures and errors happen. Depending on the setup
and the their extent such failures and errors might be anything from being
just annoying or absolutely critical. However, they all have one thing in
common -- they need to be properly reported so that counter-actions and
fixups can happen.
What does properly mean here? The failures and errors need to be reported:

in a visible place,
in a clear unambiguous format that can be understood by both humans and
machines to allow both manual and automated reactions,
in a way making sure they are persistent and archived,
together with timestamps simplifying (sometimes even allowing) analysis.

Ideally, there should be some common method of such reporting so that it's
easy to get all the storage-related events and distinguish them from other
events.

Current situation

Is very suboptimal (to put it mildly). Every technology doing some kind of
monitoring and reporting of storage events uses its own way (by
default). MD RAID sends an email to root@localhost and logs messages to
syslog, LVM (or dmeventd more precisely) logs messages to syslog,
smartd logs messages to syslog, prints warnings to console(s) and is
also listened to by udisks providing DBus signals for SMART events, etc
(no idea about Btrfs, pointers welcome!). So all the reports are in
visible places, persistent and archived and have timestamps. However, they
are far from having an unambiguous format that can be understood by both
humans and machines. MD RAID provides a way to run an executable on every
event which may be used for automation (so does dmeventd, but only for a
very limitted set of events).
There's again one common thing -- all the reports are logged. Nevertheless,
it's nearly impossible to get all storage-related events and distinguish
them from other events in the logs because they are more or less just
free-form text.

How to improve the current situation?

It's obvious that to make getting all storage-related reports easy, they
need to be in one place. Which is already happening -- they are all being
logged. However, in a way that makes them hard to distinguish from other
messages and basically impossible to further automatically analyze and
process.
Fortunately, modern systems are using journald for logging which supports
the so called structured logging. In other words, logging where a log
entry is not just a plain free-form text message, but a set of key-value
pairs (with values potentially being arbitrary BLOBs). So a trivial
solution would be to add an extra field TYPE=storage-event to all
messages for storage events being logged (with MESSAGE being the standard
field for the free-form text) and then use the field to filter logs for
storage event reports. And similarly, other extra fields can be used to
provide machine-readable information like DEVICE, STATE,... as well as
tips for human administrators like links to documentation or man pages.
Since this is a very common (use) case not only for storage-related events
reporting, journal has a very nice builtin support for all this. There are
a few predefined/well-known fields and a few automatically-computed fields
added by journald itself, but the logging process can pass arbitrary extra
fields on its own even with multiple key-value pairs having the same key
(and thus multiple values for the key). And for the TYPE field it has a
predefined key with a rather confusing name MESSAGE_ID which is not an ID
of the given message but an ID of the message's type/category/... The value
of MESSAGE_ID is supposed to be a 128bit UUID written as hexadecimal
numbers without any separators. Creating a new such UUID can be done with:
# journalctl --new-id128

which also returns code chunks for C and Python to work with the
provided UUID value.
Moreover, journal also supports the so called catalogs. A catalog entry
defines a verbose email-like template that can be used to generate a nice
message for a given MESSAGE_ID using the values from the log entry. For a
storage related event it could look like this:
-- 3183267b90074a4595e91daef0e01462
Subject: Storage event reported by @SOURCE@
Defined-by: storage-event-reporting
Documentation: man:storage-event-reporting(7)
Documentation: man:@SOURCE_MAN@

Event: @DEVICE@ is @STATE@: @DETAILS@.
Event-level: @PRIORITY@ (@PRIORITY_DESC@)

In case the device is failing or failed, or if the event level is
4 (warning) or lower (more significant), an immediate action may be
needed to prevent system malfunction!

where the first line is the MESSAGE_ID and Subject:, Defined-by and
Documentation items (like email headers) are predefined and recommended. A
concrete message can then look like this (output from journalctl -xe
MESSAGE_ID=3183267b90074a4595e91daef0e01462):
Mar 24 11:27:15 localhost.localdomain logger[28370]: SMART is reporting too many ECC failures on /dev/sda
-- Subject: Storage event reported by smartd
-- Defined-by: storage-event-reporting
-- Documentation: man:storage-event-reporting(7)
-- Documentation: man:smartd(8)
--
-- Event: /dev/sda is failing: Too many ECC failures.
-- Event-level: 4 (warning)
--
-- In case the device is failing or failed, or if the event level is
-- 4 (warning) or lower (more significant), an immediate action may be needed
-- to prevent system malfunction!

The automated processing would of course need the raw key-value pairs, but
it's those what is stored in the journal of course. The above verbose
message is just generated on demand based on the template. So for example
udisks could easily just monitor the journal filtering the messages
based on the above MESSAGE_ID and use the key-value pairs to generate
DBus signals (using the journal API for all that not text processing, of
course). And those DBus signals could be connected to e.g. GNOME Shell to
inform user that something is going on (as is now already happening for
warnings/errors coming from smartd and "forwarded" by udisks). Of
course, not all messages need to be errors and failures, there are harmless
and good things happening too like thin pool being grown in the VG (with
some free space) on demand or a multipath path coming back "online".
Once the Storage Instantiation Daemon is implemented and running on
systems it can be used to get a better picture of what's affected by the
event. It will have a tree of block devices with all their dependencies. So
knowing there's a problem on e.g. /dev/sda SID will be able to tell
that it is for example actually a backing device for a LUKS mapping which
is a member of an MD RAID device /dev/md127 which in turn is being
used as a PV in VG system that has LVs root, swap and
home. This may sound like a crazy complex example, but is actually a
common scenario/setup.

Examples

This is how a storage event report can be sent to journal from shell:
$ printf 'MESSAGE_ID=3183267b90074a4595e91daef0e01462\nDETAILS=Too many ECC failures\nDEVICE=/dev/sda\nSTATE=failing\nMESSAGE=SMART is reporting too many ECC failures on
/dev/sda\nSOURCE_MAN=smartd(8)\nSOURCE=smartd\nPRIORITY=4\nPRIORITY_DESC=warning' |logger --journald

and from Python:
import systemd.journal
systemd.journal.send("SMART is reporting too many ECC failures on /dev/sda", MESSAGE_ID="3183267b90074a4595e91daef0e01462", DETAILS="Too many ECC failures", DEVICE="/dev/sda", STATE="failing",
                     SOURCE_MAN="smartd(8)", SOURCE="smartd", PRIORITY="4", PRIORITY_DESC="warning")

To get all such reports later:
$ journalctl -xe MESSAGE_ID=3183267b90074a4595e91daef0e01462

or from Python:
reader = systemd.journal.Reader()
reader.this_boot()
reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462")
for _entry in reader:
...:   print(reader.get_catalog())  # prints the generated string

To get the actual fields in Python:
reader = systemd.journal.Reader()
reader.this_boot()
reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462")
for entry in reader:
    print(entry)  # entry is a dict

To watch journal for changes and process them:
state = reader.wait(-1)
while state > 0:  # state < 0 means error
    # could also be INVALID -> should create new Reader()
    if state == systemd.journal.APPEND:
        print(reader.get_next())
    state = reader.wait(-1)

And the best thing about the above Python examples is that a C code doing
the same looks almost the same -- even the C API provides these high-level
functions and a single "object" to operate on.