lucasw/ros2_reflections.md

## ros2_reflections.md

      
    Raw
  

              ros2_reflections.md
            
          
    ROS2 - October 2018 - March 2019

There and back again

Started using ROS2 toward end of bouncy release, upgraded to crystal shortly after release, now have returned to ros1.
Maybe will try again when Dashing is release (June 2019?), or will wait for certain key features (some of these maybe already have been achieved, or were already achieved when I was using it but the right way wasn't clear to me):

ros2 topic echo /image_raw shouldn't take 100% cpu, rqt_image_view shouldn't take 50% cpu.  This may be fixed with https://github.com/ros2/rosidl_python/pull/ 35
ros2 node info should handle namespaces
dead nodes (and their topics) should die when their processes die, not be listed in ros2 node/topic list.  Shouldn't have to kill the ros2_daemon to clear them out.
ros2 node list shouldn't show multiple entries for same named nodes (or were some of those dead nodes from previous cycles)
Composition - ability to load nodes and parameters into any namespace.
Launch, support for composition.  Expect launch to look very different when I revisit ros2 (maybe they'll have brought back xml), don't want to continually relearn and rewrite launch files.

Longer term issues that may not be addressed any time soon:

python nodes seem very cpu hungry- they already were on the high side in ros1, now seem much worse.
CPU resources even in C++ nodes seem high in many situations (TODO apples-apples comparisons)
Responsiveness of localhost systems in general, command line tools
sourcing setup.bash takes a long time, avoid sourcing it in .bashrc
localhost inter-process communications - dds doesn't seem to be very good at this vs. TCP.
colcon build seems slower (TODO need to test that), clunkier than catkin build.
catkin build foo and rebuild it and run it vs. colcon build --mixin my_release --packages-select foo

going off the rails

Ran into performance issues with megapixel image pipelines at modest framerates.
lucasw/ros2_cpp_py#3 and https://answers.ros.org/question/312964/ros2-megapixel-image-pubsub-cpu-usage-is-very-high/
There is some capability analogous to ros1 nodelets to publish unique_ptr rather than shared_ptr, and not be copied if subscribing node is in same process.
Also non-default dds has shared memory capability?
Didn't investigate either too much- the unique_ptr still limited to 1:1 pipeline, 1 subscriber for 1 publisher.
Made own internal publish susbscribe system to bypass ros2 publishing, relatively easy to toggle ros2 publishing on certain topics when desired.
This worked fine for single node, then started looking into composition.
Had to subclass rclcpp::Node, convert all necessary nodes that dealt with images to use this internal pub sub.
Then had to fork rclcpp Node so that namespace and parameter and remappings could be used.
Starting to become unclear on what of ros2 was still be used- services (though they sometimes were unreliable?  rtsp bug about that),  rviz some.  But I would have to fork that to avoid huge rviz cpu usage.  (Though there are many situations in ros1 where rviz cpu usage seems higher than it ought to).
Ran into a bug involving the forked rclcpp support for namespaces in composition nodes- support is non-existent in that realm, re-evaluated the past few months of experiences, and went back to ros1.
other

6 month intensive release cycle:

people most knowledgeable about current release are far beyond it in development of next release.
Not a lot of time left for ros2 questions on answers.ros.org.
Lots of breakage between releases - better to wait for stability.
Time spent on workarounds for current versions quickly made redundant if issue is fixed at source.
New OSRF staff taking on workload from founders, some learning curve there.
(Is force push into pull request standard practice anywhere?  Can be disabled in github, make a backup of a branch you want to be able to continue to use, use the backup in the pr.)

Presentations from companies like Cruise automation (worth billions of dollars and having hundreds of engineers) (TODO link to the roscon 2018 video, verify it was Cruise and that they were using ros2) cherry picking what the want from ros2 and re-writing all the rest aren't encouraging for us with much more limited resources.
Big companies (Amazon, Microsoft, Google) are funding ros2 but not clear whether they are actually using it.
image performance issues with ros1

https://answers.ros.org/question/232919/image-related-nodes-eating-up-majority-of-cpu-until-system-becomes-unresponsive/
Whatever that was seemed to go away.
https://answers.ros.org/question/219510/ros-traffic-over-gigabit-connection-is-renegotiated-to-100-mbps/
I believe this was exactly the sort of problem ros2 and dds was meant to address- an underperforming network link.
But I don't want to compromise all the local traffic because of that link, I just want better management in that weak link- dds on where I want it, and localhost TCP where I don't, and transparent node composition to outdo localhost TCP.
DDS


The issue with lossy networks and ROS 1 was that it used TCP almost exclusively, and if you lost data, TCP would try to resend it, which would further stress the network and you could end up saturating the network and not even keeping up at all. Especially since the common use case for this was streaming some sensor data over wifi to a workstation to visualize it in rviz, in which case you don't care if you miss a few messages. ROS 1 does have a UDP transport, but it had several issues, for example being unreliable for large data and not being supported uniformly (python never supported it).


DDS has unreliable and reliable communication and graceful degradation, i.e. a reliable publisher can send data to an unreliable subscriber (but not the other way around). But more importantly, DDS's reliable communication happens over UDP with a custom protocol on top (DDSI-RTPS), which has the advantage over TCP that you can control things like how long it will retry to send data, how long it will wait for a NAK, how it will buffer data before sending (like Nagle's algorithm), etc...


Basically, the idea is that DDS's configuration options allow it to be many things between TCP and simple UDP, including a more flexible version of TCP, which in turn allows you to fine tune your communication settings to better work on lossy networks.


This comes at the cost of complexity and some performance (TCP on the local host is really good), but should allow knowledgeable users to get good results in more situations.

https://answers.ros.org/question/319218/how-does-ros2-implement-its-network-design/