Related MR on Error Monitor Design
Initial implementation of the monitored node API
In Autoware, it should be possible to:
- Monitor the rates of certain publishers
- Monitor the states of the nodes
- Visualize all this information from a single place with a GUI
Also my questions on the MR:
- Do we really need a monitored subscriber?
- Does monitored node need a timer at all?
I think the subscriber of each node shouldn't worry about its input frequencies, it would be too much of a hassle to set this for each specific node as a launch parameter.
Instead, If there was a centralized way of entering expected minimum and maximum latencies of specific nodes, it would look more organized and different configurations could be orchestrated/managed more easily.
I like the idea of having monitored publishers.
In your implementation, the monitored publisher in a latched way publishes following:
- <topic_name>.min_publish_interval_ms
- <topic_name>.max_publish_interval_ms
- <topic_name>.max_callback_duration_ms (I didn't get this one)
I think, the monitored publisher should do this instead:
- The constructor of the publisher takes in
min_publish_interval_ms
andmax_publish_interval_ms
as parameters - The constructor of the publisher creates another normal publisher named
<topic_name>.tick_diagnostic
.tick_diagnostic
has a header only.- Everytime the publish(..) method is called,
<topic_name>.tick_diagnostic
is published along with the intended message.
Since the message will be so light, it shouldn't be much of an issue on the network traffic side.
And the central state monitor could subscribe to these *.tick_diagnostic
messages
(from its internally managed list which is set from its params.yaml).
And it could check if all these publisher are publishing with their intended rates.
If not, it could trigger the emergency handling actions.
And we would be able to monitor each publisher's rate and even visualize them with a GUI if we wanted to.
http://design.ros2.org/articles/node_lifecycle.html
First of all, all nodes in autoware should inherit from rclcpp_lifecycle::LifecycleNode
and have following primary states:
- Unconfigured
- Inactive
- Active
- Finalized
and following intermediate states:
- Configuring
- CleaningUp
- ShuttingDown
- Activating
- Deactivating
- ErrorProcessing
In the end of that document it says:
Extensions This lifecycle will be required to be supported throughout the toolchain as such this design is not intended to be extended with additional states. It is expected that there will be more complicated application specific state machines. They may exist inside of any lifecycle state or at the macro level these lifecycle states are expected to be useful primitives as part of a supervisory system.
That means we should manage further states separately from this state machine.
I thought a lot about adding some more states but those I think should be enough.
I'd suggest we use register_on_error
to publish the error severity to the Autoware State Manager with a topic like <node_name>.error_diagnostic
to publish some custom message like autoware_auto_system_msgs/msg/HazardStatus.idl.
This manager will subscribe to lifecycle_msgs::msg::TransitionEvent
messages like in here and be notified of the state changes.
And will perform service call <node_name>__get_state
to know about the initial state in the beginning once. (explained in detail in https://index.ros.org/p/lifecycle/ )
This node will also subscribe to <node_name>.error_diagnostic
topics defined in its params.yaml
file.
For all the nodes, it will perform emergency handling actions accordingly and visualize the states and/or errors of these nodes.
We'd specify which nodes are supposed to run and which nodes are optional in the configuration file of this node.
duration_callback
and monitor it along with other things.