Skip to content

Instantly share code, notes, and snippets.

@whitmo
Last active August 29, 2015 14:05
Show Gist options
  • Save whitmo/f4c49ca7552d580ef745 to your computer and use it in GitHub Desktop.
Save whitmo/f4c49ca7552d580ef745 to your computer and use it in GitHub Desktop.

Health Check

Services framework

{
 health: <one or more callbacks>
}

Health callbacks

def callback(service_name):
    return sub_report_data

Report data

Full report:

---
service: service name
qos: nil | <0.0-1.0>
health: [passed|warn|fail]
state:
  name: [building,installing,churning,blocked,error,up]
  duration: <time in state>
  blockers: <list of dependencies if state is blocker>
checks:
  - name: <health check name>
    health: [passed|warn|fail]
    message: <text description of check>
    data: <structured output data>

Subset report:

Information like the transient state of the service make the most sense to persist in a file or log. Each check could contribute the full report with a subset to be merged:

---
name: db-cxn
qos: 0.99
health: passed
message: "we can talk to the DB"
data:
  ping: 0.002s

Service data collation

Ultimately this live in the state server. We have several options for the time being.

  • Collate on the client ie juju cfs

    This would be simple for the purposes of creating a quick and custom utility but would either divide service logic or dictate a fixed merging strategy.

  • Collate on the orchestrator

    This would emulate a bit closer to what will ultimately live in core. This method could also facilate custom merging rules that could be defined in the service generation blocks to display service specific views of unit health info.

More precise reporting makes sense for unit by unit inspection, but some sort of service rollup is necessary for general scanning.

Architectures: Execution, transport and periodicity

We have a few different hammers here:

  • juju ssh from orchestrator (reconciler) to execute a script

  • reap a log from a unit local daemon (or chron) that periodically executes a script

  • network access to a unit local daemon allowing remote triggering of health check execution (or reaping of a log created by periodic execution).

  • a daemon on the unit executes periodically checks and pushes data back to the orchestrator.

juju run can be ruled out due to it's inability to blockage by failed hooks, etc.

Proposed implementation

Most basic function (summary of what's happening on a node) could consist of:

  • service "health" hook
  • series of reconciler ssh runs to collect json from hook

More sophisticated (more similar to idealized future)

  • go daemon on each unit which:
  • hooks can report "status" to via a command (similar to proposal)
  • periodic execution of health checks and structured logging of results
  • result shipping to reconciler
  • adhoc execution of check by query

Daemon would be set up and run as part of the OrchestratorRelationHook which would pass any needed information about connecting to the reconciler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment