Skip to content

Instantly share code, notes, and snippets.

@vpodzime
Last active April 4, 2017 20:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vpodzime/a003875f7509994924f1dec584b192e0 to your computer and use it in GitHub Desktop.
Save vpodzime/a003875f7509994924f1dec584b192e0 to your computer and use it in GitHub Desktop.

Reporting and monitoring storage events

With any type of storage failures and errors happen. Depending on the setup and the their extent such failures and errors might be anything from being just annoying or absolutely critical. However, they all have one thing in common -- they need to be properly reported so that counter-actions and fixups can happen.

What does properly mean here? The failures and errors need to be reported:

  • in a visible place,
  • in a clear unambiguous format that can be understood by both humans and machines to allow both manual and automated reactions,
  • in a way making sure they are persistent and archived,
  • together with timestamps simplifying (sometimes even allowing) analysis.

Ideally, there should be some common method of such reporting so that it's easy to get all the storage-related events and distinguish them from other events.

Current situation

Is very suboptimal (to put it mildly). Every technology doing some kind of monitoring and reporting of storage events uses its own way (by default). MD RAID sends an email to root@localhost and logs messages to syslog, LVM (or dmeventd more precisely) logs messages to syslog, smartd logs messages to syslog, prints warnings to console(s) and is also listened to by udisks providing DBus signals for SMART events, etc (no idea about Btrfs, pointers welcome!). So all the reports are in visible places, persistent and archived and have timestamps. However, they are far from having an unambiguous format that can be understood by both humans and machines. MD RAID provides a way to run an executable on every event which may be used for automation (so does dmeventd, but only for a very limitted set of events).

There's again one common thing -- all the reports are logged. Nevertheless, it's nearly impossible to get all storage-related events and distinguish them from other events in the logs because they are more or less just free-form text.

How to improve the current situation?

It's obvious that to make getting all storage-related reports easy, they need to be in one place. Which is already happening -- they are all being logged. However, in a way that makes them hard to distinguish from other messages and basically impossible to further automatically analyze and process.

Fortunately, modern systems are using journald for logging which supports the so called structured logging. In other words, logging where a log entry is not just a plain free-form text message, but a set of key-value pairs (with values potentially being arbitrary BLOBs). So a trivial solution would be to add an extra field TYPE=storage-event to all messages for storage events being logged (with MESSAGE being the standard field for the free-form text) and then use the field to filter logs for storage event reports. And similarly, other extra fields can be used to provide machine-readable information like DEVICE, STATE,... as well as tips for human administrators like links to documentation or man pages.

Since this is a very common (use) case not only for storage-related events reporting, journal has a very nice builtin support for all this. There are a few predefined/well-known fields and a few automatically-computed fields added by journald itself, but the logging process can pass arbitrary extra fields on its own even with multiple key-value pairs having the same key (and thus multiple values for the key). And for the TYPE field it has a predefined key with a rather confusing name MESSAGE_ID which is not an ID of the given message but an ID of the message's type/category/... The value of MESSAGE_ID is supposed to be a 128bit UUID written as hexadecimal numbers without any separators. Creating a new such UUID can be done with:

# journalctl --new-id128

which also returns code chunks for C and Python to work with the provided UUID value.

Moreover, journal also supports the so called catalogs. A catalog entry defines a verbose email-like template that can be used to generate a nice message for a given MESSAGE_ID using the values from the log entry. For a storage related event it could look like this:

-- 3183267b90074a4595e91daef0e01462
Subject: Storage event reported by @SOURCE@
Defined-by: storage-event-reporting
Documentation: man:storage-event-reporting(7)
Documentation: man:@SOURCE_MAN@

Event: @DEVICE@ is @STATE@: @DETAILS@.
Event-level: @PRIORITY@ (@PRIORITY_DESC@)

In case the device is failing or failed, or if the event level is
4 (warning) or lower (more significant), an immediate action may be
needed to prevent system malfunction!

where the first line is the MESSAGE_ID and Subject:, Defined-by and Documentation items (like email headers) are predefined and recommended. A concrete message can then look like this (output from journalctl -xe MESSAGE_ID=3183267b90074a4595e91daef0e01462):

Mar 24 11:27:15 localhost.localdomain logger[28370]: SMART is reporting too many ECC failures on /dev/sda
-- Subject: Storage event reported by smartd
-- Defined-by: storage-event-reporting
-- Documentation: man:storage-event-reporting(7)
-- Documentation: man:smartd(8)
--
-- Event: /dev/sda is failing: Too many ECC failures.
-- Event-level: 4 (warning)
--
-- In case the device is failing or failed, or if the event level is
-- 4 (warning) or lower (more significant), an immediate action may be needed
-- to prevent system malfunction!

The automated processing would of course need the raw key-value pairs, but it's those what is stored in the journal of course. The above verbose message is just generated on demand based on the template. So for example udisks could easily just monitor the journal filtering the messages based on the above MESSAGE_ID and use the key-value pairs to generate DBus signals (using the journal API for all that not text processing, of course). And those DBus signals could be connected to e.g. GNOME Shell to inform user that something is going on (as is now already happening for warnings/errors coming from smartd and "forwarded" by udisks). Of course, not all messages need to be errors and failures, there are harmless and good things happening too like thin pool being grown in the VG (with some free space) on demand or a multipath path coming back "online".

Once the Storage Instantiation Daemon is implemented and running on systems it can be used to get a better picture of what's affected by the event. It will have a tree of block devices with all their dependencies. So knowing there's a problem on e.g. /dev/sda SID will be able to tell that it is for example actually a backing device for a LUKS mapping which is a member of an MD RAID device /dev/md127 which in turn is being used as a PV in VG system that has LVs root, swap and home. This may sound like a crazy complex example, but is actually a common scenario/setup.

Examples

This is how a storage event report can be sent to journal from shell:

$ printf 'MESSAGE_ID=3183267b90074a4595e91daef0e01462\nDETAILS=Too many ECC failures\nDEVICE=/dev/sda\nSTATE=failing\nMESSAGE=SMART is reporting too many ECC failures on
/dev/sda\nSOURCE_MAN=smartd(8)\nSOURCE=smartd\nPRIORITY=4\nPRIORITY_DESC=warning' |logger --journald

and from Python:

import systemd.journal
systemd.journal.send("SMART is reporting too many ECC failures on /dev/sda", MESSAGE_ID="3183267b90074a4595e91daef0e01462", DETAILS="Too many ECC failures", DEVICE="/dev/sda", STATE="failing",
                     SOURCE_MAN="smartd(8)", SOURCE="smartd", PRIORITY="4", PRIORITY_DESC="warning")

To get all such reports later:

$ journalctl -xe MESSAGE_ID=3183267b90074a4595e91daef0e01462

or from Python:

reader = systemd.journal.Reader()
reader.this_boot()
reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462")
for _entry in reader:
...:   print(reader.get_catalog())  # prints the generated string

To get the actual fields in Python:

reader = systemd.journal.Reader()
reader.this_boot()
reader.add_match(MESSAGE_ID="3183267b90074a4595e91daef0e01462")
for entry in reader:
    print(entry)  # entry is a dict

To watch journal for changes and process them:

state = reader.wait(-1)
while state > 0:  # state < 0 means error
    # could also be INVALID -> should create new Reader()
    if state == systemd.journal.APPEND:
        print(reader.get_next())
    state = reader.wait(-1)

And the best thing about the above Python examples is that a C code doing the same looks almost the same -- even the C API provides these high-level functions and a single "object" to operate on.

@vpodzime
Copy link
Author

A short section about how SID can be used with the above described stuff needs to be added.

@tasleson
Copy link

This is some stuff I sent out a while back when others were looking at alerts

Alerts (end user perspective)

  • Very few if any 'false' alerts to prevent administrators from
    disabling
  • One place to configure the system for alerts and have it work for all
    monitored software & hardware components
  • Avoid the daemon needed for every software component and a different
    configuration for every component
  • Minimal resource foot print with limited dependency requirements
  • Reliable
  • Configurable delivery of alerts via multiple different transports
    • SNMP Traps
    • Multiple emails & SMS(via email)
    • tweet
    • external command execution (allows end user to write own script/app)
  • Ability to update the configuration easily across a large number of
    systems, ie. work with existing admin tools
  • Easy to configure with or without a GUI
  • Ability to check status of other alert daemons (eg. heartbeats) to
    ensure the alert service is indeed operational
  • Component agnostic, hardware (eg. Disk, RAID HBA, SCSI Enclosure
    Services, IPMI, multipath) and software (eg. RAID).

Alerts (design possibilities)

  • Daemon which can receive/process alerts via
  • syslog/journald filtering
  • udev
  • API
    • Via incoming dbus calls
    • Supplied command line tool so script writers can create own alerts
    • C shared library interface so other components could call it
    • Ability for plugin or some other mechanism so daemon can poll other
      components for status, thus relieving them of the need to have
      their own daemon
  • SCSI UA (via udev)

Deliver alerts:

  • DBUS signal so other components could get notified
  • SNMP/Email/SMS & other

Issues pertaining to this:

https://bugzilla.redhat.com/show_bug.cgi?id=838957
https://bugzilla.redhat.com/show_bug.cgi?id=832616

@vpodzime
Copy link
Author

A short section about how SID can be used with the above described stuff needs to be added.

Done.

@vpodzime
Copy link
Author

This is some stuff I sent out a while back when others were looking at alerts

Thanks! I'll go through the points and take them into account for the design.

@vpodzime
Copy link
Author

Added some examples demonstrating the API.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

they need to be properly reported so that counter-actions and fixups can happen.
Sounds a bit weird to me. Maybe:
they need to be properly reported so that both software and the user can react and fix the issue.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

in a clear unambiguous format that can be understood by both humans and machines to allow both manual and automated reactions

I think just reaction sounds more natural:

in a clear unambiguous format that can be understood by both humans and machines to allow both manual and automated reaction

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

in a way making sure they are persistent and archived,

Maybe:

in such a way to make sure they are persistently archived,

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

Ideally, there should be some common method of such reporting so that it's easy to get all the storage-related events and distinguish them from other events.

I would add all at the end:

Ideally, there should be some common method of such reporting so that it's easy to get all the storage-related events and distinguish them from all other events.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

There's again one common thing -- all of reports all logged.

I would add the:

There's again one common thing -- all the reports all logged.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

Fortunately, modern systems are using journald for logging which supports so called structured logging.

I would add the:

Fortunately, modern systems are using journald for logging which supports the so called structured logging.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

Moreover, journal also supports so called catalogs.

Same as before:

Moreover, journal also supports the so called catalogs.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

using the journal API for all that not text processing, of course

I would reword this a bit

of course using the journal API for all of this, not custom text processing

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

gnome-shell to inform user that something is going on

Looks like the official capitalization is GNOME Shell:

GNOME Shell to inform user that something is going on

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

Once the Storage Instantiation Daemon is implemented and running on systems it can be used to get a better picture of what's affected by the event.

I think it would be good to link to some resources about SID from there, same thing probably for some other projects the intended audience might want to learn about (of course not mainstream stuff like GNOME Shell).

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

Which may sound like a crazy complex example, but is actually a common scenario/setup.

I would rather use This:

This may sound like a crazy complex example, but is actually a common scenario/setup.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

The automated processing would of course need the raw key-value pairs, but it's those what is stored in the journal of course. The above verbose message is just generated on demand based on the template.

Sounds a bit confusing to me, maybe rather (if correct):

The automated processing would just need the raw key-value pairs, which are actually the only thing stored in Journal - the verbose message above is just generated on demand based on the template.

@M4rtinK
Copy link

M4rtinK commented Mar 28, 2017

And that's it from me. :)

@dwlehman
Copy link

It looks like the journal already has what we need, minus two pieces:

  1. DBus API to send signals, manage filters, &c
  2. Lots of patches to various software to use structured logging as an event reporting mechanism

Is this accurate?

@vpodzime
Copy link
Author

vpodzime commented Apr 4, 2017

I think it would be good to link to some resources about SID from there, same thing probably for some other projects the intended audience might want to learn about (of course not mainstream stuff like GNOME Shell).

Sure, once there are some (about SID). Definitely a good point for a future blog post based on this.

@vpodzime
Copy link
Author

vpodzime commented Apr 4, 2017

And that's it from me. :)

Thanks!

@vpodzime
Copy link
Author

vpodzime commented Apr 4, 2017

Is this accurate?

Exactly.

  1. Lots of patches to various software to use structured logging as an event reporting mechanism

Yes, but in order to get the data for further processing we need it structured (unless we want to parse it from the free-form text, of course). And the journal API is simple yet powerful/versatile, available in C, Python,..., available on all systems that have systemd and easy to replace with plain syslog if not available. So I believe it is a better choice than any new API for event reporting we could come up with.

  1. DBus API to send signals, manage filters, &c

I think this should be the second step. @tasleson also has it in his overview. But I think we first need a single source of the data (some "sink") which can then be easily watched by a daemon propagating those events to DBus, ZeroMQ interface or wherever else. And again the journal API is great for any tool that doesn't want to bother with DBus, zmq or anything like that.

Does the above make sense @dwlehman?

@dwlehman
Copy link

dwlehman commented Apr 4, 2017

Yep, makes sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment