First, let's define various error levels (log levels).
Fatal/Critical: The application composition root should look something like this:
def main(*args)
try
# compose and start
rescue StandardError => err
log.fatal(err)
exit(1)
end
end
Fatal should only ever be used to indicate an uncaught exception that crashes the application.
Error: This is a problem that should be investigated immediately. It could indicate:
- A situation which is fatal to the operation, but not the service or application (can't open a required file, missing data, etc.). These errors will force user (administrator, or direct user) intervention. These should be reserved for incorrect connection strings, missing services, etc.
- A scenario that is correctable, but is not validated at the proper level/depth, and therefore may not be immediately clear what the resolution is.
As an example:
Data too long for column 'foo' at row 1: INSERT INTO 'bar'
Warn: This could be a problem, might not. Use this in sitations such as:
- Transient environment conditions, such network/db connectivity. These issues should be escalated to error if they do not recover.
Info: Generally useful information to log (service start/stop, configuration assumptions, etc). Info I want to always have available but usually don't care about under normal circumstances. Allow insight into coarse-grain progress of an operation.
Trace: Fine-grain progress of an operation, including context. Typically this would not be enabled in production.
Debug: Debug < Trace. Debug messages are not encouraged, instead use useful trace
levels. These should never be enabled in production.
The top level of an operation should contain a try/catch to log when an operation is fatally exits, and can not recover. In most circumstances, it is preferable to not crash the application if the operation fails. We'll want to wrap individual operations with specific rescues in order to provide the proper exit code to the user.
Example:
def main(*args)
try
# note on exit codes in this example:
# 0: success
# 1: failure due to external circumstances (something that can be fixed by the operator):
# * configuration
# * network connection
# * external dependent service problem
# 2: failure due to bug in code
# if the config file is missing, or malformed, we cannot continue
config, err = try_initialize_config
if err
log.error(err)
exit(1)
end
# the above code could also be written like this
# if initialize_config could throw an error instead of return success/failure
try
config = initialize_config
rescue InitializationFailedError => err
log.error(err)
exit(1)
end
# this is not included in the log.error try catch blocks because
# if something happens during initialization of the class,
# it indicates a code bug, not something the operator can fix
thing_doer = ThingDoer.new(config.stuff)
try
thing_doer.do_thing()
rescue ThingDoerError => err
log.error(err)
exit(1)
end
exit(0)
rescue StandardError => err
log.fatal(err)
# set an exit code that means something disastrous happened
# this is different than the operation failed.
# this should almost _always_ mean there's a bug in the code.
exit(2)
end
end
You should only ever catch an exception inside an operation if you plan on doing something about it.
If you plan on reraising, don't log the error, it will be logged where it's appropriate upstream.
def publish_to_api(data={})
tries ||= 3
log.info "attempting to publish to api"
DataLibrary.publish(data)
rescue DataLibraryFailureException => e
unless (tries -= 1).zero?
log.warn "publish to api failed, will attempt retry"
retry
else
# we'll reraise here because we don't know what else we can do
# unless upstream has another method for handling this situation,
# we'll end up logging this error at the _operation_ level,
# so we don't need to explicitly log here
raise
end
else
log.info "success!"
end