Skip to content

Instantly share code, notes, and snippets.

@xmfcx
Last active February 3, 2022 13:16
Show Gist options
  • Save xmfcx/7eaae9750d6317a3a2aa23745ac99444 to your computer and use it in GitHub Desktop.
Save xmfcx/7eaae9750d6317a3a2aa23745ac99444 to your computer and use it in GitHub Desktop.

Autoware Diagnostics and Monitoring

Related MR on Error Monitor Design

Initial implementation of the monitored node API

In Autoware, it should be possible to:

  • Monitor the rates of certain publishers
  • Monitor the states of the nodes
  • Visualize all this information from a single place with a GUI

Also my questions on the MR:

  • Do we really need a monitored subscriber?
  • Does monitored node need a timer at all?

Do we really need a monitored subscriber?

I think the subscriber of each node shouldn't worry about its input frequencies, it would be too much of a hassle to set this for each specific node as a launch parameter.

Instead, If there was a centralized way of entering expected minimum and maximum latencies of specific nodes, it would look more organized and different configurations could be orchestrated/managed more easily.

Monitoring the rates of certain publishers

I like the idea of having monitored publishers.

In your implementation, the monitored publisher in a latched way publishes following:

  • <topic_name>.min_publish_interval_ms
  • <topic_name>.max_publish_interval_ms
  • <topic_name>.max_callback_duration_ms (I didn't get this one)

I think, the monitored publisher should do this instead:

  • The constructor of the publisher takes in min_publish_interval_ms and max_publish_interval_ms as parameters
  • The constructor of the publisher creates another normal publisher named <topic_name>.tick_diagnostic
  • .tick_diagnostic has a header only.
  • Everytime the publish(..) method is called, <topic_name>.tick_diagnostic is published along with the intended message.

Since the message will be so light, it shouldn't be much of an issue on the network traffic side.

And the central state monitor could subscribe to these *.tick_diagnostic messages (from its internally managed list which is set from its params.yaml).

And it could check if all these publisher are publishing with their intended rates.

If not, it could trigger the emergency handling actions.

And we would be able to monitor each publisher's rate and even visualize them with a GUI if we wanted to.

Monitoring the states of the nodes

http://design.ros2.org/articles/node_lifecycle.html

First of all, all nodes in autoware should inherit from rclcpp_lifecycle::LifecycleNode

and have following primary states:

  • Unconfigured
  • Inactive
  • Active
  • Finalized

and following intermediate states:

  • Configuring
  • CleaningUp
  • ShuttingDown
  • Activating
  • Deactivating
  • ErrorProcessing

In the end of that document it says:

Extensions This lifecycle will be required to be supported throughout the toolchain as such this design is not intended to be extended with additional states. It is expected that there will be more complicated application specific state machines. They may exist inside of any lifecycle state or at the macro level these lifecycle states are expected to be useful primitives as part of a supervisory system.

That means we should manage further states separately from this state machine.

I thought a lot about adding some more states but those I think should be enough.

I'd suggest we use register_on_error to publish the error severity to the Autoware State Manager with a topic like <node_name>.error_diagnostic to publish some custom message like autoware_auto_system_msgs/msg/HazardStatus.idl.

Autoware State Manager

This manager will subscribe to lifecycle_msgs::msg::TransitionEvent messages like in here and be notified of the state changes.

And will perform service call <node_name>__get_state to know about the initial state in the beginning once. (explained in detail in https://index.ros.org/p/lifecycle/ )

This node will also subscribe to <node_name>.error_diagnostic topics defined in its params.yaml file.

For all the nodes, it will perform emergency handling actions accordingly and visualize the states and/or errors of these nodes.

We'd specify which nodes are supposed to run and which nodes are optional in the configuration file of this node.

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

  • Have a single node for diagnostic monitoring of topics and nodes.
  • Each topic is associated with the corresponding publishing node in the configuration.
  • From each monitored subscriber, publish a duration_callback and monitor it along with other things.

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

https://gitlab.com/autowarefoundation/autoware.auto/AutowareAuto/-/issues/282

Old Discussion Implementing LifeCycleNode Support on Autoware.Auto Nodes

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

  • Service calls should be monitored by the method that is performing the service call.
  • If the response is delayed, the method/node should switch to an error state and publish diagnostics message so that central error monitor can report it to the user.

@kenji-miyake
Copy link

Monitor the rates of certain publishers
Monitor the states of the nodes
Visualize all this information from a single place with a GUI

These are all possible with our proposal error_monitor.
Regarding the states, we should clarify what are the states are, for example, just alive/dead or including something else.

I think the subscriber of each node shouldn't worry about its input frequencies, it would be too much of a hassle to set this for each specific node as a launch parameter.
Instead, If there was a centralized way of entering expected minimum and maximum latencies of specific nodes, it would look more organized and different configurations could be orchestrated/managed more easily.

I agree, we need a centralized way.

Since the message will be so light, it shouldn't be much of an issue on the network traffic side.

I feel it's not perfect since sometimes it might happen that the publisher published tick_diagnostics but the subscriber cannot receive the core message.
So I believe checking on the subscriber's side is necessary in any case.

First of all, all nodes in autoware should inherit from rclcpp_lifecycle::LifecycleNode

In the long term, I agree.
But for prototypes on Universe, I think we shouldn't force it from the beginning.
Before that, we need to get familiar with the feature.

I'd suggest we use register_on_error to publish the error severity to the Autoware State Manager with a topic like <node_name>.error_diagnostic to publish some custom message like autoware_auto_system_msgs/msg/HazardStatus.idl.

I generally agree, but we should discuss the design more.

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

Regarding the states, we should clarify what are the states are, for example, just alive/dead or including something else.

These come from the rclcpp_rclcpp_lifecycle::LifecycleNodelifecycle::LifecycleNode ( http://design.ros2.org/articles/node_lifecycle.html )

Unconfigured: Just started
Inactive: Parameters are set but won't do any processing (maybe the executor doesn't run yet)
Active: Actively working and processing
Finalized: Dead gracefully or with error, no way back

And there are the transition states which should cover all that we need.

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

These are all possible with our proposal error_monitor.

Current error monitor doesn't work with the http://design.ros2.org/articles/node_lifecycle.html

There is no rosnode kill -a in ROS2 unless we conform this generic interface. I think it should be more clear once we create a draft package and showcase its utility.

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

I feel it's not perfect since sometimes it might happen that the publisher published tick_diagnostics but the subscriber cannot receive the core message.

When can this happen?

So I believe checking on the subscriber's side is necessary in any case.

I don't know how that is possible so I can't comment on this yet.

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

In the long term, I agree.
But for prototypes on Universe, I think we shouldn't force it from the beginning.
Before that, we need to get familiar with the feature.

We should try to make it as simple to implement as possible. Then adding it to the existing nodes would be trivial.

I am also not familiar with the feature so we are on the same boat :)

@xmfcx
Copy link
Author

xmfcx commented Jan 24, 2022

For broader design sense, I think the best benefit of the Managed nodes is that they can be monitored from ROS2 toolset and it is a part of the framework. It is already out there. We should just familiarize, simplify and implement it. We don't need to re-engineer it from scratch.

@kenji-miyake
Copy link

These come from the rclcpp_rclcpp_lifecycle::LifecycleNodelifecycle::LifecycleNode ( http://design.ros2.org/articles/node_lifecycle.html )

I see, thank you.

Current error monitor doesn't work with the http://design.ros2.org/articles/node_lifecycle.html

Yes, but I think lifecycle is an optional part that can integrate into the current error_monitor framework. I guess it's like just adding some diagnostics related to node statuses.

When can this happen?

For example, regardless of it can happen with the current system, if we change the middleware's settings to "it can send up to 1MB per message" and we send tick_diagnostics(assuming 1KB)/image(assuming 2MB), the subscriber can receive only tick_diagnostics and will drop image.

Although it's an abnormal case, I think we can't say it never happens for example due to network congestion or something like that.

I don't know how that is possible so I can't comment on this yet.

Hmm? I feel it's possible for example with a customized subscriber class.

@kenji-miyake
Copy link

I think this Discourse thread is related.
https://discourse.ros.org/t/add-heartbeat-message-type/24162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment