A fairly common question from IOTstack users running InfluxDB is how to remove "bad data" from a database?
This gist focuses on InfluxDB 1.8 and the InfluxQL commands needed to operate upon an InfluxDB 1.8 database. You are entirely on your own if you are using InfluxDB 2.x.
The best advice I can give you is to avoid the problem by ensuring that "bad data" never gets into your database in the first place. For example:
-
if you are coding a device such as an ESP32 plus a temperature sensor, decide the valid range of temperatures for your location and only send measurements within that range. Use a separate MQTT topic to signal error conditions such as:
- sensor not responding
- sensor API returning nulls or NaNs
- sensor value out of expected range
-
if you are using an off-the-shelf device where you don't control the coding, check for nulls, NANs or out-of-range conditions in your Node-RED flow. Don't just blindly feed MQTT payloads into Influx.
Of course, the best laid plans of coders and database designers often go awry and you are left with a database containing garbage. The biggest problem InfluxDB users face is that InfluxQL, despite some superficial appearances to the contrary, is not SQL. With SQL, deleting erroneous rows (tuples) from a database is trivial. Using InfluxQL to delete erroneous observations is not quite as simple.
The basic problem is that you can't use a WHERE clause to select records for deletion based on the values of a field (which is a fairly common requirement).
This gist covers the following use-cases:
- delete one or a small number of observations;
- delete an arbitrarily large number of observations;
- delete a range of observations bounded by time; and
- delete all the observations associated with a given sensor.
If you built your IOTstack system using PiBuilder then IOTstackAliases will be installed. Alternatively, you can install IOTstackAliases yourself. Either way, you will have access to the influx
alias.
The third option is to define the influx
alias yourself, like this:
$ alias influx='docker exec -it influxdb influx -precision=rfc3339'
In words, the influx
alias executes (docker exec
) the influx
command inside the influxdb
container, setting up the session as an interactive terminal (the -it
flags) and passing the -precision=rfc3339
argument to the influx
command. The -precision
argument causes the influx
command to display timestamps in human-readable form rather than in integer nanoseconds UTC since the Unix epoch.
When you invoke the influx
alias, the >
prompt indicates that you are talking to the influx
command line utility (CLI) inside the InfluxDB container:
$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10
>
To leave the influx
CLI, either press control+d or type exit
and press return:
> exit
$
It is always a good idea to take a backup of your databases before you start deleting records. One way to do that is to use IOTstackBackup. That's another thing that will have been installed automatically if you constructed your IOTstack system using PiBuilder.
To backup your InfluxDB databases:
-
The
influxdb
container must be running: -
Execute:
$ iotstack_backup_influxdb snapshot.tar
If you need to restore your databases from the snapshot:
-
The
influxdb
container should not be running. It's a good idea to take down the whole stack so other services like Node-RED don't get confused. -
Execute:
$ iotstack_restore_influxdb snapshot.tar
-
Bring up the container (or the entire stack).
-
Wait until
docker ps
says the InfluxDB container has been running for 30 seconds.
I have a number of temperature sensors scattered around the house and outside in the garden. Let's suppose I've decided that any temperature above 43°C (~109°F) is erroneous and needs to be deleted from my database.
The first step is to identify the problematic rows. To do that I need to connect to the influx
CLI:
$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10
>
Then I need to specify the database I want to work on:
> USE weather
Using database weather
>
Tip:
- If you can't remember your database names, run the
SHOW DATABASES
command.
My weather database has several measurements:
> SHOW MEASUREMENTS
name: measurements
name
----
humidity
pressure
rainfall
temperature
wind
>
Each measurement has at least one series, tagged according to the sensor that produced the observations:
> SHOW SERIES
key
---
humidity,device=laundry
humidity,device=lounge
humidity,device=porch
humidity,device=bedroom
pressure,device=bedroom
rainfall,device=weatherstation
temperature,device=laundry
temperature,device=lounge
temperature,device=bedroom
temperature,device=porch
temperature,device=slab
temperature,device=weatherstation
wind,device=weatherstation
Now, let's query the temperature
measurement to see how many observations exceed 43°C:
> SELECT * FROM temperature WHERE temp > 43.0
name: temperature
time device temp
---- ------ ----
2021-01-12T05:40:23.559695388Z weatherstation 44.1
2021-01-12T05:46:26.870682618Z weatherstation 43.9
2021-01-23T05:21:40.495749257Z weatherstation 43.3
2021-01-23T05:27:43.72322541Z weatherstation 43.2
2021-01-23T06:16:18.582892504Z weatherstation 43.2
2021-01-23T06:22:21.871653732Z weatherstation 43.6
2023-02-17T06:17:53.175654391Z weatherstation 43.2
Notes:
- It is a convention to present keywords like "SELECT" and "FROM" in upper-case. However, both SQL and InfluxQL are case-insensitive for keywords so please don't feel the need to capitalise everything unless you want to.
- The "weatherstation" device is a Misol weather station, albeit with the electronics replaced with a Whisper Node. This device is outside in full sun (New South Wales, Australia) so it makes sense that it reports temperatures in this range.
The result of the SELECT
statement above is seven hits. That's actually a fairly easy number to deal with, one by one, but I'm only going to remove the first. I'll save the others for the next use-case.
To remove that first entry, you must use its timestamp, like this:
> DELETE FROM temperature WHERE time = '2021-01-12T05:40:23.559695388Z'
InfluxDB stores records using nanoseconds since the Unix epoch as the unique key. Providing you can specify the unique key exactly, you can remove the associated record.
Note:
- timestamps must be surrounded with single quote marks. You will get an error if you use double quotes.
Let's re-run the query to prove that the first hit has been removed:
> SELECT * FROM temperature WHERE temp > 43.0
name: temperature
time device temp
---- ------ ----
2021-01-12T05:46:26.870682618Z weatherstation 43.9
2021-01-23T05:21:40.495749257Z weatherstation 43.3
2021-01-23T05:27:43.72322541Z weatherstation 43.2
2021-01-23T06:16:18.582892504Z weatherstation 43.2
2021-01-23T06:22:21.871653732Z weatherstation 43.6
2023-02-17T06:17:53.175654391Z weatherstation 43.2
>
This one is slightly trickier because it involves the wholesale copying of tables - twice.
You should not attempt this on a "live" database. The result is undefined if Node-RED or some other data-writer is actively committing new observations while you are running these commands. You would be well advised to shut down those other data-writers first.
The steps:
-
Copy your existing table to a temporary intermediate table:
> SELECT * INTO cleaned FROM temperature WHERE temp <= 43.0 GROUP BY *
Notes:
- The
WHERE
clause specifies the observations you want to retain. It is the inverse of the earlier query selecting the records you want to delete. - The
GROUP BY *
clause is critical. If you omit it, you will lose your tags and all your series will be coalesced into a single series. - The time this command takes to execute depends on the size of your tables. Be patient!
- The
-
Remove the original table:
> DROP MEASUREMENT temperature
This removes the measurement and all associated series, including the records you want to delete.
-
Copy the contents of the temporary table into a table with the same name as your original table:
> SELECT * INTO temperature FROM cleaned GROUP BY *
Note:
- Once again, the
GROUP BY *
clause is critical. If you omit it, you will lose your tags and all your series will be coalesced into a single series.
- Once again, the
-
Remove the temporary table:
> DROP MEASUREMENT cleaned
-
Confirm that the measurements you wanted to remove have disappeared:
> SELECT * FROM temperature WHERE temp > 43.0 >
Silence is an empty result set.
Let's suppose all the data supplied by the weather sensor on Jan 1st 2024 was suspect. Providing you can safely set bounds by time then you can use a range.
First, how many observations were there in total for all of that date?
> SELECT count(temp) FROM temperature WHERE time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
name: temperature
time count
---- -----
2024-01-01T00:00:00+11:00 1475
And how many of those were supplied by the weatherstation sensor?
SELECT count(temp) FROM temperature WHERE device = 'weatherstation' AND time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
name: temperature
time count
---- -----
2024-01-01T00:00:00+11:00 240
The reason for doing both queries is to remind you that it is important to be precise. If we just deleted by time range, we would wind up clobbering all 1,475 records from all sensors rather than just the 240 records from the problem sensor.
I have specified the time ranges in my local time (UTC+11). The tz()
function tells the InfluxCLI to display the results in my local time. I do it this way so I can be sure that the time-range I'm targeting is what I think of as January 1st. The usefulness of the tz()
function becomes more apparent when you list actual records rather than counts.
The restriction on using a WHERE clause to select records for deletion only applies to fields, not tags. All by itself, that's a very good reason for employing tags for your invariant metadata, reserving fields for your data.
Now that we can be precise, we can delete by time-range:
DELETE FROM temperature WHERE device = 'weatherstation' AND time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00'
Prove that the records have gone:
> SELECT count(temp) FROM temperature WHERE time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
name: temperature
time count
---- -----
2024-01-01T00:00:00+11:00 1235
> SELECT count(temp) FROM temperature WHERE device = 'weatherstation' AND time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
>
Again, silence on the second query indicates an empty result set. The difference between the counts of 1,475 and 1,235 is 240. which is the number of records we expected to delete.
Commissioning a new sensor sometimes trigger's Murphy's Law. Let's assume all the temperature data getting into your database from the weatherstation sensor is garbage and you want to remove it and start over.
Providing you associated each of your sensors with its own unique tag, it's easy to delete the entire series using that tag.
The series before we begin:
> SHOW SERIES
key
---
humidity,device=laundry
humidity,device=lounge
humidity,device=porch
humidity,device=bedroom
pressure,device=bedroom
rainfall,device=weatherstation
temperature,device=laundry
temperature,device=lounge
temperature,device=bedroom
temperature,device=porch
temperature,device=slab
temperature,device=weatherstation ««« to be deleted
wind,device=weatherstation
Let's assume the weatherstation has been providing temperature values in Fahrenheit rather than Celsius (or vice versa). You have fixed the sensor but now you have data with mixed units. Although you could try to convert the values on-the-fly, it is probably easier to start over:
> DROP SERIES FROM temperature where device = 'weatherstation'
Prove the series has been removed:
> SHOW SERIES
key
---
humidity,device=laundry
humidity,device=lounge
humidity,device=porch
humidity,device=bedroom
pressure,device=bedroom
rainfall,device=weatherstation
temperature,device=laundry
temperature,device=lounge
temperature,device=bedroom
temperature,device=porch
temperature,device=slab
wind,device=weatherstation
The temperature,device=weatherstation
series has disappeared. The other series from the weatherstation (rainfall and wind speed) have been left intact, as have the temperature series from other sensors.
As soon as the weatherstation sensor posts a new temperature observation, the temperature,device=weatherstation
series will be recreated automatically.