xmfcx/Autoware Diagnostics and Monitoring.md Secret

## Autoware Diagnostics and Monitoring.md

      
    Raw
  

              Autoware Diagnostics and Monitoring.md
            
          
    Autoware Diagnostics and Monitoring

Related MR on Error Monitor Design
Initial implementation of the monitored node API
In Autoware, it should be possible to:

Monitor the rates of certain publishers
Monitor the states of the nodes
Visualize all this information from a single place with a GUI

Also my questions on the MR:

Do we really need a monitored subscriber?
Does monitored node need a timer at all?

Do we really need a monitored subscriber?

I think the subscriber of each node shouldn't worry about its input frequencies,
it would be too much of a hassle to set this for each specific node as a launch parameter.
Instead, If there was a centralized way of entering expected minimum and maximum latencies of specific nodes,
it would look more organized and different configurations could be orchestrated/managed more easily.
Monitoring the rates of certain publishers

I like the idea of having monitored publishers.
In your implementation, the monitored publisher in a latched way publishes following:

<topic_name>.min_publish_interval_ms
<topic_name>.max_publish_interval_ms
<topic_name>.max_callback_duration_ms (I didn't get this one)

I think, the monitored publisher should do this instead:

The constructor of the publisher takes in min_publish_interval_ms and max_publish_interval_ms as parameters
The constructor of the publisher creates another normal publisher named <topic_name>.tick_diagnostic
.tick_diagnostic has a header only.
Everytime the publish(..) method is called, <topic_name>.tick_diagnostic is published along with the intended message.

Since the message will be so light, it shouldn't be much of an issue on the network traffic side.
And the central state monitor could subscribe to these *.tick_diagnostic messages
(from its internally managed list which is set from its params.yaml).
And it could check if all these publisher are publishing with their intended rates.
If not, it could trigger the emergency handling actions.
And we would be able to monitor each publisher's rate and even visualize them with a GUI if we wanted to.
Monitoring the states of the nodes

http://design.ros2.org/articles/node_lifecycle.html
First of all, all nodes in autoware should inherit from rclcpp_lifecycle::LifecycleNode
and have following primary states:

Unconfigured
Inactive
Active
Finalized

and following intermediate states:

Configuring
CleaningUp
ShuttingDown
Activating
Deactivating
ErrorProcessing

In the end of that document it says:

Extensions
This lifecycle will be required to be supported throughout the toolchain as such this design is not intended to be extended with additional states.
It is expected that there will be more complicated application specific state machines.
They may exist inside of any lifecycle state or at the macro level these lifecycle states are expected to be useful primitives as part of a supervisory system.

That means we should manage further states separately from this state machine.
I thought a lot about adding some more states but those I think should be enough.
I'd suggest we use register_on_error to publish the error severity to the Autoware State Manager with a topic like <node_name>.error_diagnostic to publish some custom message like autoware_auto_system_msgs/msg/HazardStatus.idl.
Autoware State Manager

This manager will subscribe to lifecycle_msgs::msg::TransitionEvent messages like in here and be notified of the state changes.
And will perform service call <node_name>__get_state to know about the initial state in the beginning once. (explained in detail in https://index.ros.org/p/lifecycle/ )
This node will also subscribe to <node_name>.error_diagnostic topics defined in its params.yaml file.
For all the nodes, it will perform emergency handling actions accordingly and visualize the states and/or errors of these nodes.
We'd specify which nodes are supposed to run and which nodes are optional in the configuration file of this node.