Skip to content

Instantly share code, notes, and snippets.

@mildsunrise
Last active March 24, 2025 13:10
Show Gist options
  • Save mildsunrise/1d576669b63a260d2cff35fda63ec0b5 to your computer and use it in GitHub Desktop.
Save mildsunrise/1d576669b63a260d2cff35fda63ec0b5 to your computer and use it in GitHub Desktop.
Documentation of Tuya's weird compression scheme for IR codes

Tuya's IR blasters, like the ZS08, have the ability to both learn and blast generic IR codes. These IR codes are given to the user as an opaque string, like this:

A/IEiwFAAwbJAfIE8gSLIAUBiwFAC+ADAwuLAfIE8gSLAckBRx9AB0ADBskB8gTyBIsgBQGLAUALA4sB8gRAB8ADBfIEiwHJAeARLwHJAeAFAwHyBOC5LwGLAeA97wOLAfIE4RcfBYsB8gTyBEAFAYsB4AcrCYsB8gTyBIsByQHgPY8DyQHyBOAHAwHyBEAX4BVfBIsB8gTJoAMF8gSLAckB4BUvAckB4AEDBfIEiwHJAQ==

Not much is known about the format of these IR code strings, which makes it difficult to use codes obtained through other means (such as a manual implementation of the IR protocol for a particular device, or public Internet code tables) with these blasters, as well as to use codes learnt through these blasters with other brands of blasters and study their contents.

So far I've only been able to find one person who dug into this before me, who was able to understand it enough to create their own codes to blast, but not enough to understand codes learnt by the device.

This document attempts to fully document the format and also provides a (hopefully) working Python implementation.

Overview

There is no standard for IR codes, so appliances use different methods to encode the data into an IR signal, often called "IR protocols". A popular one, which could be considered an unofficial standard, is the NEC protocol. NEC specifies a way to encode 16 bits as a series of pulses of modulated IR light, but it's just one protocol.

Tuya's IR blasters are meant to be generic and work with just about any protocol. To do that, they work at a lower level and record the IR signal directly instead of detecting a particular protocol and decoding the bits. In particular, the blaster records a binary signal like this one:

  +------+     +----------+  +-+
  |      |     |          |  | |
--+      +-----+          +--+ +---

Such a signal can be represented by noting the times at which the signal flips from low to high and viceversa. It is better to record the differences of these times as they will be smaller numbers. For example, the above signal is represented as:

[7, 6, 11, 3, 2]

Meaning, the signal stays high for 7 units of time, then low for 6 units of time, then high for 11 units, and so on. The first time is always for a high state, which means even times (2nd, 4th, 6th...) are always low periods while odd times (1st, 3rd, 5th...) are always high periods.

The blaster takes these numbers (in units of microseconds) and encodes each of them as a little-endian 16-bit integer, resulting in the following 10 bytes:

07 00 06 00 0B 00 03 00 02 00

Because we're recording a signal rather than high-level protocol data, this results in very long messages in real life. So, the blaster compresses these bytes using a weird algorithm (see below), and then encodes the resulting bytes using base64 so the user can copy/paste the code easily.

Compression scheme

Update: Turns out this is FastLZ compression. No need to read this section, you can go to their website instead.

I was unable to find a public algorithm that matched this, so I'm assuming it's a custom lossless compression algorithm that a random Tuya employee hacked to make my life more complicated. Jokes aside it seems to be doing a very poor job, and if I were them I would've just used Huffman coding or something.

Anyway, the algorithm is LZ77-based, with a fixed 8kB window. The stream contains a series of blocks. Each block begins with a "header byte", and the 3 MSBs of this byte determine the type of block:

  • If the 3 bits are zero, then this is a literal block and the other 5 bits specify a length L minus one.

    Upon encountering this block, the decoder consumes the next L bytes from the stream and emits them as output.

    +---------+-----------------------------+
    |000LLLLLL| 1..32 bytes, depending on L |
    +---------+-----------------------------+
    
  • If the 3 bits have any other value, then this is a length-distance pair block; the 3 bits specify a length L minus 2, and the concatenation of the other 5 bits with the next byte specifies a distance D minus 1.

    Upon encountering this block, the decoder copies L bytes from the previous output. It begins copying D bytes before the output cursor, so if D = 1, the first copied byte is the most recently emitted byte; if D = 2, the byte before that one, and so on.

    As usual, it may happen that L > D, in which case the output repeats as necessary (for example if L = 5 and D = 2, and the 2 last emitted bytes are X and Y, the decoder would emit XYXYX).

    +--------+--------+
    |LLLDDDDD|DDDDDDDD|
    +--------+--------+
    

    As a special case, if the 3 bits are one, then there's an extra byte preceding the distance byte, which specifies a value to be added to L:

    +--------+--------+--------+
    |111DDDDD|LLLLLLLL|DDDDDDDD|
    +--------+--------+--------+
    
import io
import base64
from bisect import bisect
from struct import pack, unpack
# MAIN API
def decode_ir(code: str) -> list[int]:
'''
Decodes an IR code string from a Tuya blaster.
Returns the IR signal as a list of µs durations,
with the first duration belonging to a high state.
'''
payload = base64.decodebytes(code.encode('ascii'))
payload = decompress(io.BytesIO(payload))
signal = []
while payload:
assert len(payload) >= 2, \
f'garbage in decompressed payload: {payload.hex()}'
signal.append(unpack('<H', payload[:2])[0])
payload = payload[2:]
return signal
def encode_ir(signal: list[int], compression_level=2) -> str:
'''
Encodes an IR signal (see `decode_tuya_ir`)
into an IR code string for a Tuya blaster.
'''
payload = b''.join(pack('<H', t) for t in signal)
compress(out := io.BytesIO(), payload, compression_level)
payload = out.getvalue()
return base64.encodebytes(payload).decode('ascii').replace('\n', '')
# DECOMPRESSION
def decompress(inf: io.FileIO) -> bytes:
'''
Reads a "Tuya stream" from a binary file,
and returns the decompressed byte string.
'''
out = bytearray()
while (header := inf.read(1)):
L, D = header[0] >> 5, header[0] & 0b11111
if not L:
# literal block
L = D + 1
data = inf.read(L)
assert len(data) == L
else:
# length-distance pair block
if L == 7:
L += inf.read(1)[0]
L += 2
D = (D << 8 | inf.read(1)[0]) + 1
assert len(out) >= D
data = bytearray()
while len(data) < L:
data.extend(out[-D:][:L-len(data)])
out.extend(data)
return bytes(out)
# COMPRESSION
def emit_literal_blocks(out: io.FileIO, data: bytes):
for i in range(0, len(data), 32):
emit_literal_block(out, data[i:i+32])
def emit_literal_block(out: io.FileIO, data: bytes):
length = len(data) - 1
assert 0 <= length < (1 << 5)
out.write(bytes([length]))
out.write(data)
def emit_distance_block(out: io.FileIO, length: int, distance: int):
distance -= 1
assert 0 <= distance < (1 << 13)
length -= 2
assert length > 0
block = bytearray()
if length >= 7:
assert length - 7 < (1 << 8)
block.append(length - 7)
length = 7
block.insert(0, length << 5 | distance >> 8)
block.append(distance & 0xFF)
out.write(block)
def compress(out: io.FileIO, data: bytes, level=2):
'''
Takes a byte string and outputs a compressed "Tuya stream".
Implemented compression levels:
0 - copy over (no compression, 3.1% overhead)
1 - eagerly use first length-distance pair found (linear)
2 - eagerly use best length-distance pair found
3 - optimal compression (n^3)
'''
if level == 0:
return emit_literal_blocks(out, data)
W = 2**13 # window size
L = 255+9 # maximum length
distance_candidates = lambda: range(1, min(pos, W) + 1)
def find_length_for_distance(start: int) -> int:
length = 0
limit = min(L, len(data) - pos)
while length < limit and data[pos + length] == data[start + length]:
length += 1
return length
find_length_candidates = lambda: \
( (find_length_for_distance(pos - d), d) for d in distance_candidates() )
find_length_cheap = lambda: \
next((c for c in find_length_candidates() if c[0] >= 3), None)
find_length_max = lambda: \
max(find_length_candidates(), key=lambda c: (c[0], -c[1]), default=None)
if level >= 2:
suffixes = []; next_pos = 0
key = lambda n: data[n:]
find_idx = lambda n: bisect(suffixes, key(n), key=key)
def distance_candidates():
nonlocal next_pos
while next_pos <= pos:
if len(suffixes) == W:
suffixes.pop(find_idx(next_pos - W))
suffixes.insert(idx := find_idx(next_pos), next_pos)
next_pos += 1
idxs = (idx+i for i in (+1,-1)) # try +1 first
return (pos - suffixes[i] for i in idxs if 0 <= i < len(suffixes))
if level <= 2:
find_length = { 1: find_length_cheap, 2: find_length_max }[level]
block_start = pos = 0
while pos < len(data):
if (c := find_length()) and c[0] >= 3:
emit_literal_blocks(out, data[block_start:pos])
emit_distance_block(out, c[0], c[1])
pos += c[0]
block_start = pos
else:
pos += 1
emit_literal_blocks(out, data[block_start:pos])
return
# use topological sort to find shortest path
predecessors = [(0, None, None)] + [None] * len(data)
def put_edge(cost, length, distance):
npos = pos + length
cost += predecessors[pos][0]
current = predecessors[npos]
if not current or cost < current[0]:
predecessors[npos] = cost, length, distance
for pos in range(len(data)):
if c := find_length_max():
for l in range(3, c[0] + 1):
put_edge(2 if l < 9 else 3, l, c[1])
for l in range(1, min(32, len(data) - pos) + 1):
put_edge(1 + l, l, 0)
# reconstruct path, emit blocks
blocks = []; pos = len(data)
while pos > 0:
_, length, distance = predecessors[pos]
pos -= length
blocks.append((pos, length, distance))
for pos, length, distance in reversed(blocks):
if not distance:
emit_literal_block(out, data[pos:pos + length])
else:
emit_distance_block(out, length, distance)
@burkminipup
Copy link

Something that's worth noting, if you have any timings larger than 65,535 you will need to just set it to 65,535, since each timing is represented with 2 bytes. I had some like that at the end of the commands, but it worked fine just shortening it.

Thanks @Bazoogle I ran into this issue on a new remote last night. Updated the code to support clamping integers >65535 to match exactly 65535 and it worked.

@pasthev
Copy link

pasthev commented Feb 21, 2025

Hola Alba - just want to let you know that I've been asked about Tuya encoding in Sensus IR/RF Converter discussion and since I had no Tuya IR device to play with, I used your great analysis of this (weird) protocol to produce an online Tuya codec application through Streamlit: https://irtuya.streamlit.app/.

Helped me to de-scramble some packets to the core command and explain it here, so, thanks a lot!

I might further develop this Streamlit code to do the whole decoding / encoding of the NEC commands, but since this app, at the moment, is basically only offering an HTTP interface for the equivalent of your code, let me know if you prefer me to remove it!

Pascal

@pasthev
Copy link

pasthev commented Feb 22, 2025

@magicus , maybe a bit late, but now that I'm looking at the Tuya encoding a few years after I coded Sensus, I see your past questions about your captured frames... AFAIK, the code repeats you see is simply due to too-long presses when learning a sequence - I had the same issue with Broadlink, which led me to build Sensus for IR and RF signal conversions in order to clean my recorded signals.

Good news is that the decoding of your sequences works like a charm: all you have to do is open Sensus IR & RF Code Converter , paste the sequence in the Raw field, then at the bottom of the page, under Raw analysis, click Read raw.

Example

9041, 4524, 550, 550, 550, 1722, 550, 1722, 550, 1722, 550, 550, 550, 1722, 550, 1722, 550, 550, 550, 1722, 550, 602, 550, 550, 550, 550, 550, 1722, 550, 550, 550, 550, 550, 1722, 550, 1722, 550, 1722, 550, 550, 550, 1722, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 1722, 550, 550, 550, 1722, 550, 1722, 550, 1722, 550, 1722, 550, 40362, 9041, 4524, 550, 550, 550, 1722, 550, 1722, 550, 1722, 550, 602, 550, 1722, 550, 1722, 550, 550, 550, 1722, 550, 550, 550, 550, 550, 602, 550, 1722, 550, 550, 550, 550, 550, 1722, 550, 1722, 550, 1722, 550, 550, 550, 1722, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 1722, 550, 550, 550, 1722, 550, 1722, 550, 1722, 550, 1722, 550

Sensus result

> Nec encoding detected  with shift value: 0/1 *01110110 10001001 + 11010000 00101111* - hex: *** 7689 d02f *** - short: < 6e0b >

7689 d02f is the raw signal.
6e0b is the "real" NEC decoded instruction: command 0b addressed to peripheral 6e

Also, if you simply paste your command in Raw field, and hit Convert, it will be converted to multiple formats, and the top raw will show the command
7689 d02f + 7689 d02f, confirming that the exact same command is sent twice, despite the slight value differences you noted.
Now if you delete one of the repeated command and hit "Convert", you'll get a cleaned sequence in the Raw field.

This is a late reply, but hope it will be useful to others. :)
Pascal

@Astro1247
Copy link

I would like to express my deep appreciation to @mildsunrise for the published results of the work done. This information has helped me a lot in creating the custom integration of my old air conditioner into Home Assistant!
Without this gist it would take me much longer time to create and understand it!

@magicus
Copy link

magicus commented Feb 25, 2025

@pasthev Oooh, that is really interesting and nice! And no, it was not too late, I have not really had enough time to spend to make any progress in this area.

My plan (or maybe more realistically, "dream") is to convert all this information in an easy-to-use Home Assistant custom component (or possibly add-on), that can interact with these Tuya devices. At this point, it is basically about combining mildsunrise's decoding logic with your Sensus functionality, and integrate it with the Home Assistant ecosystem. (Which could be challenging enough...) But at least I see a clear way forward, and it is "just" about spending enough time on it.

@pasthev
Copy link

pasthev commented Feb 26, 2025

@magicus glad I could help !

I am nearly done with an online conversion tool that does the whole thing ; from Tuya to Raw to Nec Signal to actual device "short" command, same URL as before: IRTuya

App can already do the whole thing one way, from Tuya to Short Nec command. I realize not many people have a use of the "short" commands, but I like to be able to get them, not only because these are the "real" command that is being sent, but also because manufacturers sometimes share them in their user manual - that's the case for my old Pioneer amplifier.
Here's an example of what these look like for a very common Samsung TV remote :

NAME            Samsung_BN59-00940A
KEY_POWER	0x40BF 
KEY_TV          0xD827 
KEY_1           0x20DF

...and the super important thing is that you can sometimes get discrete manufacturer's code this way - i.e., separate Power On and Power Off short codes despite only having a single On/Off toggle key on the remote control - this is priceless for automation. Here's a full LIRC file listing all the commands for this same RC.

IRTuya is already able to properly identify and decode both NEC standard & NEC Extended packets - I don't think I'll try to add more protocols, as Sensus is already programmed to try some tricky dirty stuff in order to decode non-standard signals, and it's been a pain to program.

I also implemented a visual representation of the IR signal with time scale in IRTuya - might be useful for a quick overview of unidentified signals.

I need to refine a few details, like implementing a conversion the other way around (from NEC to Tuya), then I'll share the source on GitHub - hope you'll find some useful stuff in this code for your your add-on.
Pascal

@pasthev
Copy link

pasthev commented Mar 3, 2025

@mildsunrise

Alba, while re-using your code (thank a lot for this), I realized the decoding would lead to infinite loop when the provided string, assumed to be in B64, would contain random, non-FastLZ, data.
i.e. "2222222222222222222222222222" is a valid B64 string as it conforms with re.match(r'^[A-Za-z0-9+/]*={0,2}$' and len %4

FastLZ doesn't seem to have much sanity checks options, but following additions to the code, although not 100% secure, has been enough to reduce the infinite loop occurrences to 0 in my case:

def _might_be_fastlz_level1(data: bytes) -> bool:
    """
    Checks if the data *might* be in FastLZ Level 1 compressed format.
    Heuristic check based on the first byte, FastLZ Level 1 spec.
    FastLZ documentation: https://github.com/ariya/FastLZ
    """

    if not data or len(data) < 2:
        # FastLZ Level 1 block needs at least 2 bytes: header + data
        return False

    first_byte = data[0]

    # Extract block tag (3 most significant bits - MSB)
    block_tag = first_byte >> 5

    # Check if block tag is 0 for Level 1
    if block_tag != 0b000: # Block tag for Level 1 must be 0 (binary '000')
        return False

    # Extract literal run size (5 least significant bits - LSB) and add 1
    literal_run_size = (first_byte & 0b00011111) + 1

    if len(data) < literal_run_size:
        # Data length is shorter than the declared literal run size
        # Not enough bytes to contain the literal run.
        return False

    return True


(decode_ir:)

    #B64 check
    if re.match(r'^[A-Za-z0-9+/]*={0,2}$', payload):
        try:
            decoded_bytes = base64.b64decode(payload)
                if _might_be_fastlz_level1(decoded_bytes):
                    payload = decompress(io.BytesIO(payload))
                    (code)

                else:
                    pass

        except Exception:
            pass

Pascal

@mildsunrise
Copy link
Author

Thanks @pasthev! I'm away from my computer but the only thing I can see that would cause an infinite loop when decoding is if the very first block is a length-distance pair one... I've just added an assert to reject invalid bytestreams referring to data that isn't there, which should also prevent the infinite loop :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment