whitmo/cf-health-check.md

## cf-health-check.md

      
    Raw
  

              cf-health-check.md
            
          
    Health Check

Services framework

{
 health: <one or more callbacks>
}

Health callbacks

def callback(service_name):
    return sub_report_data


Report data

Full report:
---
service: service name
qos: nil | <0.0-1.0>
health: [passed|warn|fail]
state:
  name: [building,installing,churning,blocked,error,up]
  duration: <time in state>
  blockers: <list of dependencies if state is blocker>
checks:
  - name: <health check name>
    health: [passed|warn|fail]
    message: <text description of check>
    data: <structured output data>

Subset report:
Information like the transient state of the service make the most sense to persist in a file or log. Each check could contribute the full report with a subset to be merged:
---
name: db-cxn
qos: 0.99
health: passed
message: "we can talk to the DB"
data:
  ping: 0.002s

Service data collation

Ultimately this live in the state server. We have several options for
the time being.


Collate on the client ie juju cfs
This would be simple for the purposes of creating a quick and
custom utility but would either divide service logic or dictate a
fixed merging strategy.


Collate on the orchestrator
This would emulate a bit closer to what will ultimately live in
core. This method could also facilate custom merging rules that
could be defined in the service generation blocks to display
service specific views of unit health info.


More precise reporting makes sense for unit by unit inspection, but
some sort of service rollup is necessary for general scanning.
Architectures: Execution, transport and periodicity

We have a few different hammers here:


juju ssh from orchestrator (reconciler) to execute a script


reap a log from a unit local daemon (or chron) that periodically
executes a script


network access to a unit local daemon allowing remote triggering of
health check execution (or reaping of a log created by periodic
execution).


a daemon on the unit executes periodically checks and pushes data
back to the orchestrator.


juju run can be ruled out due to it's inability to blockage by
failed hooks, etc.
Proposed implementation

Most basic function (summary of what's happening on a node) could consist of:

service "health" hook
series of reconciler ssh runs to collect json from hook

More sophisticated (more similar to idealized future)

go daemon on each unit which:


hooks can report "status" to via a command (similar to proposal)
periodic execution of health checks and structured logging of results
result shipping to reconciler
adhoc execution of check by query

Daemon would be set up and run as part of the OrchestratorRelationHook which would pass any needed information about connecting to the reconciler.