Skip to content

Instantly share code, notes, and snippets.

@Paraphraser
Last active February 16, 2024 22:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Paraphraser/3b3aa4c78bfad71789bdd1c9ba81365f to your computer and use it in GitHub Desktop.
Save Paraphraser/3b3aa4c78bfad71789bdd1c9ba81365f to your computer and use it in GitHub Desktop.
IOTstack tutorial: removing bad data from InfluxDB

IOTstack tutorial: removing bad data from InfluxDB

A fairly common question from IOTstack users running InfluxDB is how to remove "bad data" from a database?

This gist focuses on InfluxDB 1.8 and the InfluxQL commands needed to operate upon an InfluxDB 1.8 database. You are entirely on your own if you are using InfluxDB 2.x.

prevention is better than cure

The best advice I can give you is to avoid the problem by ensuring that "bad data" never gets into your database in the first place. For example:

  • if you are coding a device such as an ESP32 plus a temperature sensor, decide the valid range of temperatures for your location and only send measurements within that range. Use a separate MQTT topic to signal error conditions such as:

    • sensor not responding
    • sensor API returning nulls or NaNs
    • sensor value out of expected range
  • if you are using an off-the-shelf device where you don't control the coding, check for nulls, NANs or out-of-range conditions in your Node-RED flow. Don't just blindly feed MQTT payloads into Influx.

when you have no option

Of course, the best laid plans of coders and database designers often go awry and you are left with a database containing garbage. The biggest problem InfluxDB users face is that InfluxQL, despite some superficial appearances to the contrary, is not SQL. With SQL, deleting erroneous rows (tuples) from a database is trivial. Using InfluxQL to delete erroneous observations is not quite as simple.

The basic problem is that you can't use a WHERE clause to select records for deletion based on the values of a field (which is a fairly common requirement).

This gist covers the following use-cases:

  • delete one or a small number of observations;
  • delete an arbitrarily large number of observations;
  • delete a range of observations bounded by time; and
  • delete all the observations associated with a given sensor.

accessing your InfluxDB databases

If you built your IOTstack system using PiBuilder then IOTstackAliases will be installed. Alternatively, you can install IOTstackAliases yourself. Either way, you will have access to the influx alias.

The third option is to define the influx alias yourself, like this:

$ alias influx='docker exec -it influxdb influx -precision=rfc3339'

In words, the influx alias executes (docker exec) the influx command inside the influxdb container, setting up the session as an interactive terminal (the -it flags) and passing the -precision=rfc3339 argument to the influx command. The -precision argument causes the influx command to display timestamps in human-readable form rather than in integer nanoseconds UTC since the Unix epoch.

When you invoke the influx alias, the > prompt indicates that you are talking to the influx command line utility (CLI) inside the InfluxDB container:

$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10
>

To leave the influx CLI, either press control+d or type exit and press return:

> exit
$ 

take a backup before you begin

It is always a good idea to take a backup of your databases before you start deleting records. One way to do that is to use IOTstackBackup. That's another thing that will have been installed automatically if you constructed your IOTstack system using PiBuilder.

To backup your InfluxDB databases:

  1. The influxdb container must be running:

  2. Execute:

    $ iotstack_backup_influxdb snapshot.tar

If you need to restore your databases from the snapshot:

  1. The influxdb container should not be running. It's a good idea to take down the whole stack so other services like Node-RED don't get confused.

  2. Execute:

    $ iotstack_restore_influxdb snapshot.tar
  3. Bring up the container (or the entire stack).

  4. Wait until docker ps says the InfluxDB container has been running for 30 seconds.

scenario

I have a number of temperature sensors scattered around the house and outside in the garden. Let's suppose I've decided that any temperature above 43°C (~109°F) is erroneous and needs to be deleted from my database.

The first step is to identify the problematic rows. To do that I need to connect to the influx CLI:

$ influx
Connected to http://localhost:8086 version 1.8.10
InfluxDB shell version: 1.8.10
>

Then I need to specify the database I want to work on:

> USE weather
Using database weather
>

Tip:

  • If you can't remember your database names, run the SHOW DATABASES command.

My weather database has several measurements:

> SHOW MEASUREMENTS
name: measurements
name
----
humidity
pressure
rainfall
temperature
wind
> 

Each measurement has at least one series, tagged according to the sensor that produced the observations:

> SHOW SERIES
key
---
humidity,device=laundry
humidity,device=lounge
humidity,device=porch
humidity,device=bedroom
pressure,device=bedroom
rainfall,device=weatherstation
temperature,device=laundry
temperature,device=lounge
temperature,device=bedroom
temperature,device=porch
temperature,device=slab
temperature,device=weatherstation
wind,device=weatherstation

Now, let's query the temperature measurement to see how many observations exceed 43°C:

> SELECT * FROM temperature WHERE temp > 43.0
name: temperature
time                           device         temp
----                           ------         ----
2021-01-12T05:40:23.559695388Z weatherstation 44.1
2021-01-12T05:46:26.870682618Z weatherstation 43.9
2021-01-23T05:21:40.495749257Z weatherstation 43.3
2021-01-23T05:27:43.72322541Z  weatherstation 43.2
2021-01-23T06:16:18.582892504Z weatherstation 43.2
2021-01-23T06:22:21.871653732Z weatherstation 43.6
2023-02-17T06:17:53.175654391Z weatherstation 43.2

Notes:

  • It is a convention to present keywords like "SELECT" and "FROM" in upper-case. However, both SQL and InfluxQL are case-insensitive for keywords so please don't feel the need to capitalise everything unless you want to.
  • The "weatherstation" device is a Misol weather station, albeit with the electronics replaced with a Whisper Node. This device is outside in full sun (New South Wales, Australia) so it makes sense that it reports temperatures in this range.

use-case: delete a single observation

The result of the SELECT statement above is seven hits. That's actually a fairly easy number to deal with, one by one, but I'm only going to remove the first. I'll save the others for the next use-case.

To remove that first entry, you must use its timestamp, like this:

> DELETE FROM temperature WHERE time = '2021-01-12T05:40:23.559695388Z'

InfluxDB stores records using nanoseconds since the Unix epoch as the unique key. Providing you can specify the unique key exactly, you can remove the associated record.

Note:

  • timestamps must be surrounded with single quote marks. You will get an error if you use double quotes.

Let's re-run the query to prove that the first hit has been removed:

> SELECT * FROM temperature WHERE temp > 43.0
name: temperature
time                           device         temp
----                           ------         ----
2021-01-12T05:46:26.870682618Z weatherstation 43.9
2021-01-23T05:21:40.495749257Z weatherstation 43.3
2021-01-23T05:27:43.72322541Z  weatherstation 43.2
2021-01-23T06:16:18.582892504Z weatherstation 43.2
2021-01-23T06:22:21.871653732Z weatherstation 43.6
2023-02-17T06:17:53.175654391Z weatherstation 43.2
> 

use-case: delete a larger number of observations

This one is slightly trickier because it involves the wholesale copying of tables - twice.

You should not attempt this on a "live" database. The result is undefined if Node-RED or some other data-writer is actively committing new observations while you are running these commands. You would be well advised to shut down those other data-writers first.

The steps:

  1. Copy your existing table to a temporary intermediate table:

    > SELECT * INTO cleaned FROM temperature WHERE temp <= 43.0 GROUP BY *

    Notes:

    • The WHERE clause specifies the observations you want to retain. It is the inverse of the earlier query selecting the records you want to delete.
    • The GROUP BY * clause is critical. If you omit it, you will lose your tags and all your series will be coalesced into a single series.
    • The time this command takes to execute depends on the size of your tables. Be patient!
  2. Remove the original table:

    > DROP MEASUREMENT temperature

    This removes the measurement and all associated series, including the records you want to delete.

  3. Copy the contents of the temporary table into a table with the same name as your original table:

    > SELECT * INTO temperature FROM cleaned GROUP BY *

    Note:

    • Once again, the GROUP BY * clause is critical. If you omit it, you will lose your tags and all your series will be coalesced into a single series.
  4. Remove the temporary table:

    > DROP MEASUREMENT cleaned
  5. Confirm that the measurements you wanted to remove have disappeared:

    > SELECT * FROM temperature WHERE temp > 43.0
    > 

    Silence is an empty result set.

use-case: delete a range of observations

Let's suppose all the data supplied by the weather sensor on Jan 1st 2024 was suspect. Providing you can safely set bounds by time then you can use a range.

First, how many observations were there in total for all of that date?

> SELECT count(temp) FROM temperature WHERE time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
name: temperature
time                      count
----                      -----
2024-01-01T00:00:00+11:00 1475

And how many of those were supplied by the weatherstation sensor?

SELECT count(temp) FROM temperature WHERE device = 'weatherstation' AND time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
name: temperature
time                      count
----                      -----
2024-01-01T00:00:00+11:00 240

The reason for doing both queries is to remind you that it is important to be precise. If we just deleted by time range, we would wind up clobbering all 1,475 records from all sensors rather than just the 240 records from the problem sensor.

I have specified the time ranges in my local time (UTC+11). The tz() function tells the InfluxCLI to display the results in my local time. I do it this way so I can be sure that the time-range I'm targeting is what I think of as January 1st. The usefulness of the tz() function becomes more apparent when you list actual records rather than counts.

The restriction on using a WHERE clause to select records for deletion only applies to fields, not tags. All by itself, that's a very good reason for employing tags for your invariant metadata, reserving fields for your data.

Now that we can be precise, we can delete by time-range:

DELETE FROM temperature WHERE device = 'weatherstation' AND time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00'

Prove that the records have gone:

> SELECT count(temp) FROM temperature WHERE time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
name: temperature
time                      count
----                      -----
2024-01-01T00:00:00+11:00 1235

> SELECT count(temp) FROM temperature WHERE device = 'weatherstation' AND time >= '2024-01-01T00:00:00.000+11:00' AND time < '2024-01-02T00:00:00.000+11:00' tz('Australia/Sydney')
> 

Again, silence on the second query indicates an empty result set. The difference between the counts of 1,475 and 1,235 is 240. which is the number of records we expected to delete.

use-case: delete by tag

Commissioning a new sensor sometimes trigger's Murphy's Law. Let's assume all the temperature data getting into your database from the weatherstation sensor is garbage and you want to remove it and start over.

Providing you associated each of your sensors with its own unique tag, it's easy to delete the entire series using that tag.

The series before we begin:

> SHOW SERIES
key
---
humidity,device=laundry
humidity,device=lounge
humidity,device=porch
humidity,device=bedroom
pressure,device=bedroom
rainfall,device=weatherstation
temperature,device=laundry
temperature,device=lounge
temperature,device=bedroom
temperature,device=porch
temperature,device=slab
temperature,device=weatherstation     ««« to be deleted
wind,device=weatherstation

Let's assume the weatherstation has been providing temperature values in Fahrenheit rather than Celsius (or vice versa). You have fixed the sensor but now you have data with mixed units. Although you could try to convert the values on-the-fly, it is probably easier to start over:

> DROP SERIES FROM temperature where device = 'weatherstation'

Prove the series has been removed:

> SHOW SERIES
key
---
humidity,device=laundry
humidity,device=lounge
humidity,device=porch
humidity,device=bedroom
pressure,device=bedroom
rainfall,device=weatherstation
temperature,device=laundry
temperature,device=lounge
temperature,device=bedroom
temperature,device=porch
temperature,device=slab
wind,device=weatherstation

The temperature,device=weatherstation series has disappeared. The other series from the weatherstation (rainfall and wind speed) have been left intact, as have the temperature series from other sensors.

As soon as the weatherstation sensor posts a new temperature observation, the temperature,device=weatherstation series will be recreated automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment