supix/postgres_recovery.md

## postgres_recovery.md

      
    Raw
  

              postgres_recovery.md
            
          
    The problem

In some cases, it is possible that PostgreSQL tables get corrupted. This can happen in case of hardware failures (e.g. hard disk drives with write-back cache enabled, RAID controllers with faulty/worn out battery backup, etc.), as clearly reported in this wiki page. Furthermore, it can happen in case of incorrect setup, as well.
One of the symptoms of such corruptions is the following message:
ERROR: missing chunk number 0 for toast value 123456 in pg_toast_45678
This almost surely indicates that a corrupted chunk is present within a table file. But there is a good way to get rid of it.
The solution

Let's suppose that the corrupted table is called mytable.
Many articles on the Internet suggest to fire the following query against the database:
psql> select reltoastrelid::regclass from pg_class where relname = 'mytable';

 reltoastrelid      
-------------------------
 pg_toast.pg_toast_40948
(1 row)


and then to fire the following commands:
REINDEX table mytable;
REINDEX table pg_toast.pg_toast_40948;
VACUUM analyze mytable;

But in my case this was not enough.
Then, I computed the number of rows in mytable:
psql> select count(*) from mytable;

 count
-------
 58223

To find the corruption, it is possible to fetch data from the table until getting the 'Missing chunk...' error. So the following group of queries does the job:
select * from mytable order by id limit 5000 offset 0;
select * from mytable order by id limit 5000 offset 5000;
select * from mytable order by id limit 5000 offset 10000;
select * from mytable order by id limit 5000 offset 15000;
select * from mytable order by id limit 5000 offset 20000;
...

...and so on until getting the error. In this example, if you reach the offset of 55000 (55000 + 5000 is 60000 which exceeds the total number of records) without getting the error, then your table is not corrupted. The order by clause is necessary to make your query repeatable, i.e. assure that the query does not randomly return rows, and limit and offset clauses work as expected. If your table does not have an id field, you have to find a good field to order by. For performance reasons, it is preferable to select an indexed field.
In order to go faster and not get your console dirty, the query can be directly triggered from the console, redirecting the output to /dev/null and printing an error message only in case of error found:
psql -U pgsql -d mydatabase -c "select * from mytable order by id limit 5000 offset 0" > /dev/null || echo "Corrupted chunk read!"

The above syntax means: execute the query and redirect the output to /dev/null or, in case of error (||), write an error message.
Let's suppose that the first query giving the error is the following:
select * from mytable order by id limit 5000 offset 10000;
Corrupted chunk read!
>

Now, you know that the corrupted chunk is in the rows between 10000 and 14999. So, you can narrow the search by halving the query LIMIT clause.
select * from mytable order by id limit 2500 offset 10000;
Corrupted chunk read!
>

So, the error happens to be in the rows between 10000 and 12499. We halve again the rows limit.
select * from mytable order by id limit 1250 offset 10000;
>

Fetching the rows between 10000 and 12499 does not return any error. So the error must be in the rows between 11250 and 12499. We can confirm this by firing the query:
select * from mytable order by id limit 1250 offset 11250;
Corrupted chunk read!
>

So, we halve again the limit.
select * from mytable order by id limit 625 offset 11250;
>
select * from mytable order by id limit 625 offset 11875;
Corrupted chunk read!
>
...

You should continue narrowing until exactly finding the corrupted row:
...
select * from mytable order by id limit 1 offset 11963;
Corrupted chunk read!
>

Note that in this last query the LIMIT 1 clause exactly identifies only one row.
Finally, you have to find the id of the corrupted row and delete it (obviously you have a data loss):
psql> select id from mytable order by id limit 1 offset 11963;
   id
--------
 121212

psql> delete from mytable where id = 121212;
DELETE 1
>

During the search of the corrupted row, consider that, most likely, the corruption is in the last inserted/updated records, even if this is not a general rule. So you can choose a sort key that respects the physical insert/update so to lower the scan time.
If you prefer to fully automate the corrupted row search, consider using the following script (in csh syntax):
#!/bin/csh
set j = 0
while ($j < 58223) //here the total number of table rows
  psql -U pgsql -d mydatabase -c "SELECT * FROM mytable LIMIT 1 offset $j" >/dev/null || echo $j
  @ j++
end

This script prints the number of all the corrupted rows. In case of long tables, it can take long time since it performs as many queries as the number of table rows.
xyhtac wrote a tool that implements this algorithm with a binary search enhancement (way faster). You can find it here.
Credits

Credits to this post and to this tweet.