machinekoder/gist:3ba0e8a7172c0804bc3e68e25ec49bed

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    MQTT Retain Flag and potential Problems

This document summarizes my findings in potential problems using the MQTT retain flag as a default for every topic in an MQTT application. MQTT retain is a useful feature if used properly. However, the risks of using it (as developer and for users of the software) are rarely warned about in many guides. Only when working with IoT applications for some time and reading through user forums, the MQTT specs and with a bit experience using other middleware projects potential risks may surface. I write this document mainly for myself for future implementation of MQTT application so I can make better decision when to use and when not use the MQTT retain flag.
Note that I'm only considering the MQTT 5 specs here. This document is not a writeup in prosa (except for this section here maybe) and should be used more as a sheet of notes with links for making your own decisions. It's by no means complete and comments helping me improve this document are welcome.
Expected Behavior

The MQTT 5 specification is very approachable. The section about the RETAIN flag can be found here: https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901104

Good tutorial/explanation of different QoS settings in combination with retain:

http://www.steves-internet-guide.com/mqtt-retained-messages-example/
also contains good explanation of where the retain flag can be used properly


Intended Use Cases

Citing from the MQTT 5 specification https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html
Retained messages are useful where publishers send state messages on an irregular basis. A new non-shared subscriber will receive the most recent state.

Status topics with infrequent updates

e.g. announcement, IP addresses, configuration, ...
low power clients that only come online infrequently but the status usually is still valid (e.g. door sensor)


storing settings across sessions

not intended by the MQTT specs, but done by some applications


using retain flag is useful for LWT to indicate online status of a client

but also works without retain flag set (see [MQTT-3.1.2-14])


Potential Problems


topic space convolution

especially when paired with message persistence on the broker, as retained message never get deleted


new messages without retain flag don't remove retained message from server see [MQTT-3.3.1-8] only a null message does remove the retained message [MQTT-3.3.1-6]

which means clients still receive the (outdated) retained message on reconnect


client/topic user may not be aware of outdated messages

for example if no LWT or heartbeat topic is used and the client state is unknown by the topic user
autodiscovery messages/topics for hass and homie may linger around and cause wrong autodiscovery
messages don't contain a timestamp, the age of the message is unknown

can be fixed with an additional timestamp topic


ghost switching if used incorrectly

especially problematic for non-experts of the MQTT middleware
many examples and references in user forums

https://community.home-assistant.io/t/problem-with-mqtt-retain/30158
https://community.home-assistant.io/t/shelly-1-connecting-ghost-switching/98439


potential risk on implementation side of mixing publish with and without retain flag, e.g. the paho Python module uses retain flag on per message level
Retained messages with QoS=0 (default in many libraries) set can be unconditionally discarded by the broker. This means if the message isn't re-sent in a certain interval, a new subscriber to a topic might not receive a default value until a new publish happens on the publisher side.

citing from the MQTT specification
If the Server receives a PUBLISH packet with the RETAIN flag set to 1, and QoS 0 it SHOULD store the new QoS 0 message as the new retained message for that topic, but MAY choose to discard it at any time. If this happens there will be no retained message for that topic.


(Retained messages cannot contain historic values (QoS 2), but always return the most recent value)
(potential security risk as mentioned here https://www.txone.com/blog/mqtt-series-2-potential-risks-of-exposed-mqtt-brokers/ altough the claims are questionable)

Reducing the Risk of using retained messages in user facing applications

In short, consumer/user facing products don't use them as a default, but give expert users the option to use them.
Examples:

Shelly MQTT guide

https://shelly-api-docs.shelly.cloud/gen1/#mqtt-configuration
Citing from the docs
Default LWT topic and message are shellies/<shellymodel>-<deviceid>/online, false. If these are not set after a firmware upgrade -- perform a factory reset of the device. The LWT topic is retained on user configuration (if the Retain flag is set). However, we do not recommend using retained MQTT messages.


Tasmota guide

https://tasmota.github.io/docs/MQTT/#retained-mqtt-messages
Explains problems that can be caused by retained messages and how to clean them up.


Cleaning up messages


remove with empty message
restart broker if no persistence is enabled
MQTT 5, message expiry interval https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901112 mentioned in https://dev.to/emqx/the-beginners-guide-to-mqtt-retained-messages-2no3
manual cleanup (as explained by Tasmota docs for example)

The MQTT specs don't define who is responsible for cleaning up unused retained messages. However, it would be good practice to at least let user know how to do this by hand when not done automatically (see Tasmota guide on retained MQTT messages for a good example).
Similar Features in other Middleware Frameworks

Other middleware frameworks try to solve the problem of getting the most recent value of a topic in different but similar ways. Note that the mentioned frameworks are P2P an therefore don't store anything on a broker.

ROS1: latching topic (https://answers.ros.org/answers/360900/revisions/ not the official guide, but also has a good example)
ROS2/DDS: QoS setting durability transient local https://docs.ros.org/en/rolling/Concepts/About-Quality-of-Service-Settings.html

Note that mapping this behavior into other middleware frameworks can be difficult RobotWebTools/ros2-web-bridge#134


Machinetalk: full update message on connect (https://machinekoder.com/machinetalk-explained-part-4-hal-remote/)