Riebart/syslog_levels.md

## syslog_levels.md

      
    Raw
  

              syslog_levels.md
            
          
    Syslog logging levels and parameters

All logs

All logs should contain the following information, preferably encoded in JSON so that it is easily machine-parsable as well as human readable. Graylog can parse JSON, so encoding your fields in that way makes it easy to alert and filter on messaged components.

Message: The message should be a plaintext description of the event, optionally including an application-unique code.
Impact: This field should describe the impact, if any (there is no impact for debug and informational events), on the state, output, and resiliency of the application. It should be short (one sentence), but contain enough information for someone not familiar with the application to triage the event.
Correction: This field should describe any corrective action, if any (there is no corrective action required for debug, informational, and notice level events), that could, or should, be taken to resolve the impacts caused by the event. This should be short (at most two sentences) and should reference additional materials (wiki/documentation) if necessary for further details.
Supporting information: Any supporting information necessary should be included, such as error objects from Python, internal program state (for debug and events requiring corrective action that may depend on the specific runtime state at the time of the event).

Example

except subprocess.CalledProcessError as e:
    syslog.syslog(syslog.LOG_CRIT, json.dumps(
        [ { 'message': "Unable to list SLURM partitions",
            'impact': "No attempts are, or will be, made to start any workers.",
            'correction': "See supporting information for why the sinfo call failed.",
          [ e.__dict__ ]
        ]))

***

syslog.syslog(syslog.LOG_INFO, json.dumps(
    [ { 'message': "Successfully started worker " + reason,
        'impact': None,
        'correction': None },
      [ { 'partition': p, 'worker': w, 'id': id,
          'hosts_up': num_hosts, 'jobs_per_worker': jobs_per_worker,
          'avg_seconds_per_job': sec_per_job, 'descriptions': descs,
          'min_id': None if ids == [] else min(ids),
          'max_id': None if ids == [] else max(ids),
          'on_demand': None if pargs.on_demand != p else (pargs.on_demand + " / " + pargs.on_demand_reason) } ]
    ]))
Debug (debug)

The most detailed informational level, containing program state at developer-identified milestones. Debug logs should contain information that is expected to be useful in forensic situations and when diagnosing performance or runtime issues post-hoc.
Debug level messages have no impact or correction information, but will likely have supporting information.
Verbosity should not be a limiting factor, as debug messages pass through fewer intelligence filters than other message levels.
Examples


Complete client/response headers or messages in web server logs.
Query durations on a database server
Complete URIs for every API call in an application.

Informational (info)

Informational messages are the normal logging level, and should contain major application milestones and normal runtime operations.
Informational messages have no impact or correction, and minimal supporting documentation information; they should be brief and succinct.
Verbosity should be such that a person watching this level output from an application should be able to grok and keep up with the logs.
Examples


Digested HTTP logs from a web server including client, URI, and select header fields (the standard for most web servers)
Client connections to a database server
Netflix: When a user starts watching a video
Successfully retrieved a file from a website.

Notice (notice)

Notice level events are events that do not necessarily require a corrective action, but may have some minor impact. Notice level events can occur when the application performs some action due to input, or lack thereof (such as timing out idle sessions).
Notice level events may be indicative precursors to more severe events, and as such should be reduced in verbosity from informational events.
Examples


An attempt to connect to a remote server failed, but will be retried more than once.
Session timeout events
Transfer (network, disk, etc...) rate, or processing rate, or queue length, has crossed some threshold that may cause tasks to run for longer than intended.
A process failed to perform an operation using it's preferred method, but was able to succeed using an alternate method (such as attempting, but failing to retrieve a file over HTTPS, but being able to retrieve it over HTTP).
Data availability for a particular window has gaps, but not enough data is missing to prevent the consumers of that data from behaving as expected.

Warning (warning)

Warning level events are the first event class to reqire a corrective action, and should correspond with application events that negatively impact the output or ability of the application to perform it's required duties, without failing in such a way that will cause lasting impacts. Generally warning-level conditions are can be recovered from in an automatic way, and do not result in harmful output.
These events should be sufficiently rare as they will generally result in a notice being escalated through the team that is on-call and will prompt an immediate response during business hours, or a next-business-day response out-of-hours.
The impact and correction information should be sufficiently well referenced that the on-call team can triage the event quickly and, if necessary, engage the right personnel to potentially prevent any escalation of the error condition.
Examples


All but the final attempt to contact a remote server for data failed, and if the final attempt fails, an error event will be raised.
Processing or transfer rate has crossed a threshold where it is now preventing the process from keeping up with it's input, however there is sufficient buffering in place to handle some pre-set estimated duration (such as 6 hours).
A process failed to start on a particular node in a cluster, however the process was successfully started on another node.
There is missing data in a database sufficient to cause reduced accuracy or reliability of analysis performed on it, however not enough to cause the methods to fail entirely (assuming reduced accuracy is not a failure condition).

Error (err)

Error events arise when application conditions enter an unrecoverable state that results in no, incorrect, or unreliable output or operation of a system. Error level conditions are not sufficient to cause the application to crash entirely, and still represent situations that were anticipated by the programmer and have code paths that handle them.
Error level events will result in a notification being escalated through on-call staff responsible for the application for immediate triage and potential interim fix.
Examples


All attempts to contact a remote server failed, and the remote data will not be available again.
Input data does not conform to the required format or specification and so methods using that data cannot run
All workers in a cluster are unavailable, and the job being submitted is time-sensitive.

Critical (crit)

A critical event occurs when the program crashes or otherwise terminates in an unclean of unanticipated way and usually are unexpected or unanticipated in nature. Critical messages may be preceeded, or accompanied, by other messages of potentially other types from the steps that led up to the critical event. Other applications that do not depend on the failed applications are not impacted by one application failing in a critical way.
Critical events will alert ALL on-call staff related to the application for immediate resolution.
Examples


Data corruption/errors in data that exceed sane levels that cause runtime errors
Unexpected process termination, perhaps due to other misbehaving applications.

Alert (alert)

Alert level events are a pre-emptive indicator that the application has detected that it's behaviour is in a state that may soon compromise the integrity of the system as a whole. Additionally, alert events will be generated by the system when it detects that applications are misbhevaing in ways that may have begun to impact other applications or processes.
The system is not (yet) unstable or compromised at the point of an alert event.
An alert event will be broadcast to ALL on-call staff for the application as well as the system the application may be impacting.
Examples


Input events to the application are at a rate where it is likely that disk, network, or CPU saturation is occurring, impacting other applications.
Data exceeds sane levels, and is resulting in significantly more memory being allocated

Emergerncy (emerg)

Emergency events are sent by the system only, and are sent when the system is entering, or has entered, an unstable, unpredictable, or insolvent state.
Emergency events are sent to ALL on-call staff for the system in question as well as escalated through the teams that have applications (that presumably have received a notice from their application) on that system, or that depended on that system/it's applications.
Examples


Out of memory
Kernel panic
Hardware failure
AWS failure