Paraphraser/InfluxDB-data-merging.md

## InfluxDB-data-merging.md

      
    Raw
  

              InfluxDB-data-merging.md
            
          
    IOTstack tutorial: InfluxDB data merging

Scenario

A question on Discord described a situation where a new (empty) database had been created. That database contained data acquired since the rebuild but had no historical data prior to the rebuild. However, the person did have a "portable" backup of the older data and wanted to know if it was possible to merge the backup with the live data?
The answer is "yes". Merging measurements will solve this particular problem.
My test environment

I wanted to construct a scenario that was similar to the one described on Discord. I have two Raspberry Pis:

iot-hub is my "live" Pi (8GB Pi 4); and
sec-dev is my "test" Pi (4GB Pi 4).

For the record, both are full 64-bit Bullseye built with PiBuilder. InfluxDB is pulled via docker-compose as "influxdb:1.8".
iot-hub runs iotstack_backup twice a day. By restoring the most-recent backup from iot-hub onto sec-dev, I can get an experimental testbed with reasonably up-to-date data.
When my IoT devices transmit MQTT payloads, they are sent to the Mosquitto instance running on iot-hub. In my Node-RED flows, the MQTT-in nodes reference the mosquitto hostname. That means the container running on the same machine.
In other words, although Node-RED flows running on sec-dev connect to the Mosquitto instance running on sec-dev, the Mosquitto instance running on sec-dev never actually receives any data from IoT devices. The Node-RED flows don't trigger and nothing gets added to the InfluxDB databases on sec-dev.
However, if I change a Node-RED flow on sec-dev so that it subscribes to the Mosquitto instance running on iot-hub, the flow starts to receive live data, and then the flow will add that data to the Influx database on sec-dev. I'm going to exploit that behaviour for this tutorial.
Note:


This tutorial makes use of the following alias:
 alias influx='docker exec -it influxdb influx -precision=rfc3339'


My InfluxDB instance has a number of databases:
$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10

> show databases
name: databases
name
----
power
weather
computer
battery
mailbox
trigboard
airquality

The database I'm going to use in this tutorial is named "power":
> use power
Using database power

The database has two measurements:
> show measurements
name: measurements
name
----
hiking
solax


"hiking": a Class 1 Electricity Meter made by Hiking where an ESP32-based "Arduino" system is polling every 10 seconds and transmitting a payload over MQTT.
"solax": an ESP8266-based "Arduino" which waits for 5-minute updates from a SolaX inverter and relays the data over MQTT.

I'm going to focus on the "hiking" measurement:
> show field keys from hiking
name: hiking
fieldKey     fieldType
--------     ---------
current      float
exportWatts  float
frequency    float
importEnergy float
importWatts  float
meterErrors  float
powerFactor  float
voltage      float
watts        float

> show tag keys from hiking
>

I only have the one Hiking electricity meter so the payload is 100% "variable data". There is no practical need for any tag keys to describe the meter's metadata. I'm mentioning this because I'll come back to the question of tag keys a bit later.
Test construction

The starting position is to restore the most-recent iot-hub backup onto sec-dev:
$ iotstack_restore 2022-11-28_2338.iot-hub
----- Starting iotstack_restore at Tue 29 Nov 2022 09:33:28 AEDT -----
…

The runtag argument shows when the backup was triggered on iot-hub. The script does some other work before telling the InfluxDB container to create a portable backup. The "hiking" measurement gains a new row every 10 seconds so there should be some data between 23:38 and when the portable backup was actually performed by the backup script:
$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10

> use power
Using database power

> select * from hiking where time > '2022-11-28T23:38:00.000+11:00' tz('Australia/Sydney')
name: hiking
time                                current exportWatts frequency importWatts powerFactor voltage watts
----                                ------- ----------- --------- ----------- ----------- ------- -----
2022-11-28T23:38:00.53690412+11:00  1.99    22636.81    49.93     18841.09    0.6         247.4   295
2022-11-28T23:38:10.514954399+11:00 1.96    22636.81    49.97     18841.09    0.6         247.4   290
2022-11-28T23:38:20.516996631+11:00 1.97    22636.81    49.98     18841.09    0.6         247.1   290
2022-11-28T23:38:30.520111805+11:00 1.96    22636.81    49.99     18841.09    0.6         247.6   298
2022-11-28T23:38:40.570951035+11:00 1.97    22636.81    49.95     18841.09    0.6         247.5   299
2022-11-28T23:38:50.522897373+11:00 1.97    22636.81    49.96     18841.09    0.6         248.1   289
2022-11-28T23:39:00.531880988+11:00 1.97    22636.81    49.94     18841.09    0.59        247.8   287
2022-11-28T23:39:10.537390914+11:00 1.97    22636.81    49.97     18841.09    0.6         248.3   287

Exactly as expected. The last row is at timestamp 2022-11-28T23:39:10.537390914+11:00. This row will mark the boundary between "old" and "new" data once we get a bit further on.
Now, let's get rid of all of the Hiking data:
> drop measurement hiking

Prove that that worked:
> select * from hiking
> 

At this point, I changed the relevant Node-RED flow so that its MQTT-in node subscribed to the "live" instance of Mosquitto running on iot-hub. The flow starts to receive data and it gets added to the database on sec-dev:
> select * from hiking tz('Australia/Sydney')
name: hiking
time                                current exportWatts frequency importWatts powerFactor voltage watts
----                                ------- ----------- --------- ----------- ----------- ------- -----
2022-11-29T09:45:35.920157551+11:00 1.4     22636.86    50.01     18843.23    0.04        246.6   -42
2022-11-29T09:45:45.926697028+11:00 1.79    22636.86    49.97     18843.23    0.7         246.5   432
2022-11-29T09:45:55.928528623+11:00 1.22    22636.86    49.98     18843.23    0.04        247.4   12
2022-11-29T09:46:05.919267521+11:00 1.21    22636.86    49.96     18843.23    0.25        246.5   54
…

That's all "new" data. All the old data disappeared with the drop measurement. The measurement will continue to gain rows while the merging operation proceeds to restore the "old" data.
Stocktake: the state of play

The state of play can be summarised like this:


After the drop measurement, we have the equivalent of a newly-created, empty database ^†.


Changing the Node-RED flow to receive live data caused the newly-created database ^† to start gaining new rows.


The external ~/IOTstack/backups/influxdb/db path has the most-recent portable backup which contains all historical data up to and including the 2022-11-28T23:39:10.537390914+11:00 timestamp marking the boundary between "old" and "new" data.

† I'm using "database" loosely, in the sense of the original question on Discord. Strictly speaking, "power" is the database and "hiking" is the measurement (what would be called a table in SQL).


This would seem to be a reasonable approximation of the situation described on Discord.
Repair by merging

Our task is to merge what is in the portable backup (the "old" data) with what is in the live measurement (the "new" data).
The first step is to restore the "power" database from the portable backup but to give it a new name ("oldpower"):
$ docker exec influxdb influxd restore -portable -db power -newdb oldpower /var/lib/influxdb/backup
2022/11/29 09:49:29 Restoring shard 17 live from backup 20221128T123819Z.s17.tar.gz
2022/11/29 09:49:29 Restoring shard 114 live from backup 20221128T123819Z.s114.tar.gz
…

When we go into the CLI, we see an extra database:
$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10

> show databases
name: databases
name
----
power
weather
computer
battery
mailbox
trigboard
airquality
oldpower

The "oldpower" database contains the expected measurements:
> use oldpower
Using database oldpower
> show measurements
name: measurements
name
----
hiking
solax

A query shows that it has the expected data rows at the tail end:
> select * from hiking where time > '2022-11-28T23:38:00.000+11:00' tz('Australia/Sydney')
name: hiking
time                                current exportWatts frequency importWatts powerFactor voltage watts
----                                ------- ----------- --------- ----------- ----------- ------- -----
2022-11-28T23:38:00.53690412+11:00  1.99    22636.81    49.93     18841.09    0.6         247.4   295
2022-11-28T23:38:10.514954399+11:00 1.96    22636.81    49.97     18841.09    0.6         247.4   290
2022-11-28T23:38:20.516996631+11:00 1.97    22636.81    49.98     18841.09    0.6         247.1   290
2022-11-28T23:38:30.520111805+11:00 1.96    22636.81    49.99     18841.09    0.6         247.6   298
2022-11-28T23:38:40.570951035+11:00 1.97    22636.81    49.95     18841.09    0.6         247.5   299
2022-11-28T23:38:50.522897373+11:00 1.97    22636.81    49.96     18841.09    0.6         248.1   289
2022-11-28T23:39:00.531880988+11:00 1.97    22636.81    49.94     18841.09    0.59        247.8   287
2022-11-28T23:39:10.537390914+11:00 1.97    22636.81    49.97     18841.09    0.6         248.3   287

Now we'll merge the two. This is the command I'm going to use:
SELECT * INTO power..hiking FROM hiking GROUP BY *

Breaking that down into its components:

SELECT * means "select all fields and tags".
INTO power..hiking is referring to the target of the merge; the database (power) and measurement (hiking) which is the measurement that is actively ingesting "new" data from Node-RED.
FROM hiking means the hiking measurement in the working database. We set the working database earlier with the use oldpower command so this reference to hiking implies the "old" data.
GROUP BY * preserves tag keys (if any). If you omit this clause, all your tag keys will be converted to field keys. It's important to be aware of the need for this clause if you use tag keys.

Let's run it in anger. InfluxQL follows the SQL convention of being case-insensitive so don't worry about the change in case (I'm just being lazy):
> select * into power..hiking from hiking group by *
name: result
time                 written
----                 -------
1970-01-01T00:00:00Z 14535162

Doing that merge took a very long time (I didn't time it but I guess about 20 minutes). It consumed around 120% of CPU, took RAM up to 85%, gave the virtual memory swap space a fair hammering (it's all on an SSD), and sent the system temperature over 60°C. That kind of performance penalty is to be expected when you're merging 14.5 million rows!
The wash-up

And then the Pi crashed:
> packet_write_wait: Connection to 192.168.132.102 port 22: Broken pipe
[macOS] $ ping sec-dev
--- sec-dev.my.domain.com ping statistics ---
…
3 packets transmitted, 0 packets received, 100.0% packet loss

But, as Douglas Adams would say, "Don't Panic".
I had to power-cycle the Pi. When the Pi came back, it took a long time (minutes) before the Influx CLI was open for business. My guess is it was processing journals to complete any transactions that were still pending at the point of the crash.
So, was any harm done? Let's see. How many rows were in the old measurement?
$ influx
> use oldpower
> select count(voltage) from hiking
name: hiking
time                 count
----                 -----
1970-01-01T00:00:00Z 14535162

Same number as the "written" above. I know the timestamp of the last from the old data set. How many rows are in the new database at or before that timestamp?
> use power
Using database power
> select count(voltage) from hiking where time <= '2022-11-28T23:39:10.537390914+11:00'
name: hiking
time                 count
----                 -----
1970-01-01T00:00:00Z 14535162

All data present and correct! Influx is pretty darn good. You just have to be patient with it.
This is the point where I did a copy-and-paste of the terminal session and started using it as the basis for this gist. It's now (roughly) and a half hours later. How many rows in the merged database now?
> select count(voltage) from hiking
name: hiking
time                 count
----                 -----
1970-01-01T00:00:00Z 14536395

The difference is 1233 rows. At 10 seconds per row elapsed that's 3.4 hours' data. Making all due allowance for the Pi crashing in the middle.
Up, up and away!
Finishing up

We still have the side-loaded database containing the old data. Always clean up as you go:
> drop database oldpower
> exit
$

Job done!
Lessons

Fourteen-and-a-half million rows is a lot of power data. I really do need to get around to creating retention policies and continuous queries to resample the older data into something more manageable.
Demain, domani, mañana, …