Skip to content

Instantly share code, notes, and snippets.

@supix
Last active April 22, 2024 20:37
Show Gist options
  • Star 52 You must be signed in to star a gist
  • Fork 13 You must be signed in to fork a gist
  • Save supix/80f9a6111dc954cf38ee99b9dedf187a to your computer and use it in GitHub Desktop.
Save supix/80f9a6111dc954cf38ee99b9dedf187a to your computer and use it in GitHub Desktop.
Postgres error: Missing chunk 0 for toast value in pg_toast

The problem

In some cases, it is possible that PostgreSQL tables get corrupted. This can happen in case of hardware failures (e.g. hard disk drives with write-back cache enabled, RAID controllers with faulty/worn out battery backup, etc.), as clearly reported in this wiki page. Furthermore, it can happen in case of incorrect setup, as well.

One of the symptoms of such corruptions is the following message:

ERROR: missing chunk number 0 for toast value 123456 in pg_toast_45678

This almost surely indicates that a corrupted chunk is present within a table file. But there is a good way to get rid of it.

The solution

Let's suppose that the corrupted table is called mytable. Many articles on the Internet suggest to fire the following query against the database:

psql> select reltoastrelid::regclass from pg_class where relname = 'mytable';

 reltoastrelid      
-------------------------
 pg_toast.pg_toast_40948
(1 row)

and then to fire the following commands:

REINDEX table mytable;
REINDEX table pg_toast.pg_toast_40948;
VACUUM analyze mytable;

But in my case this was not enough. Then, I computed the number of rows in mytable:

psql> select count(*) from mytable;

 count
-------
 58223

To find the corruption, it is possible to fetch data from the table until getting the 'Missing chunk...' error. So the following group of queries does the job:

select * from mytable order by id limit 5000 offset 0;
select * from mytable order by id limit 5000 offset 5000;
select * from mytable order by id limit 5000 offset 10000;
select * from mytable order by id limit 5000 offset 15000;
select * from mytable order by id limit 5000 offset 20000;
...

...and so on until getting the error. In this example, if you reach the offset of 55000 (55000 + 5000 is 60000 which exceeds the total number of records) without getting the error, then your table is not corrupted. The order by clause is necessary to make your query repeatable, i.e. assure that the query does not randomly return rows, and limit and offset clauses work as expected. If your table does not have an id field, you have to find a good field to order by. For performance reasons, it is preferable to select an indexed field.

In order to go faster and not get your console dirty, the query can be directly triggered from the console, redirecting the output to /dev/null and printing an error message only in case of error found:

psql -U pgsql -d mydatabase -c "select * from mytable order by id limit 5000 offset 0" > /dev/null || echo "Corrupted chunk read!"

The above syntax means: execute the query and redirect the output to /dev/null or, in case of error (||), write an error message.

Let's suppose that the first query giving the error is the following:

select * from mytable order by id limit 5000 offset 10000;
Corrupted chunk read!
>

Now, you know that the corrupted chunk is in the rows between 10000 and 14999. So, you can narrow the search by halving the query LIMIT clause.

select * from mytable order by id limit 2500 offset 10000;
Corrupted chunk read!
>

So, the error happens to be in the rows between 10000 and 12499. We halve again the rows limit.

select * from mytable order by id limit 1250 offset 10000;
>

Fetching the rows between 10000 and 12499 does not return any error. So the error must be in the rows between 11250 and 12499. We can confirm this by firing the query:

select * from mytable order by id limit 1250 offset 11250;
Corrupted chunk read!
>

So, we halve again the limit.

select * from mytable order by id limit 625 offset 11250;
>
select * from mytable order by id limit 625 offset 11875;
Corrupted chunk read!
>
...

You should continue narrowing until exactly finding the corrupted row:

...
select * from mytable order by id limit 1 offset 11963;
Corrupted chunk read!
>

Note that in this last query the LIMIT 1 clause exactly identifies only one row.

Finally, you have to find the id of the corrupted row and delete it (obviously you have a data loss):

psql> select id from mytable order by id limit 1 offset 11963;
   id
--------
 121212

psql> delete from mytable where id = 121212;
DELETE 1
>

During the search of the corrupted row, consider that, most likely, the corruption is in the last inserted/updated records, even if this is not a general rule. So you can choose a sort key that respects the physical insert/update so to lower the scan time.

If you prefer to fully automate the corrupted row search, consider using the following script (in csh syntax):

#!/bin/csh
set j = 0
while ($j < 58223) //here the total number of table rows
  psql -U pgsql -d mydatabase -c "SELECT * FROM mytable LIMIT 1 offset $j" >/dev/null || echo $j
  @ j++
end

This script prints the number of all the corrupted rows. In case of long tables, it can take long time since it performs as many queries as the number of table rows.

xyhtac wrote a tool that implements this algorithm with a binary search enhancement (way faster). You can find it here.

Credits

Credits to this post and to this tweet.

@supix
Copy link
Author

supix commented Jan 13, 2022

@Bader72 Correct choosing either index when you have two.

Try executing again the following script.

#!/bin/csh
set j = 0
while ($j < _hereTheTableLength_)
  psql -U pgsql -d mydatabase -c "SELECT * FROM mytable order by ideliveryid LIMIT 1 OFFSET $j" >/dev/null || echo $j
  @ j++
end

Then, based on the output, you delete more corrupted rows. For example, if you have this output:

ERROR: missing chunk number 0 for toast value 21221812 in pg_toast_16624
12345

you can execute

SELECT ideliveryid FROM mytable order by ideliveryid LIMIT 1 offset 12345

Say 54321 the returned value, you execute:

DELETE FROM mytable WHERE ideliveryid = 54321

@mitchtchesnitch
Copy link

Thank you very much, this approach saved us a lot of time!

@xyhtac
Copy link

xyhtac commented May 15, 2022

Thank you so much for this insight.

I had to deal with large tables (1.2 terabytes) and thousands of corrupted fields, so i made a tool for quick binary search and recovery.

@supix
Copy link
Author

supix commented May 17, 2022

@xyhtac, thanks a lot. Linked!

@alptureci
Copy link

Hi,
Thank you for the detailed solution. I am trying to follow this solution however, when i executed the below line
REINDEX table pg_toast.pg_toast_pg_toast_2985244; I got this error :
ERROR: permission denied for schema pg_toast

I am not sure what could cause this. The user has administrative access on the database that I am trying make the fix. I granted all the Table and function rights. What am I missing? Can you please help?

@mizhka
Copy link

mizhka commented Jul 11, 2022

If anyone faces this issue, please check visibility maps as well:

SELECT c.relname, v.* 
  FROM pg_class c, 
       lateral pg_visibility(c.oid::regclass) v 
 WHERE c.relkind in ('r','t') 
   and v.all_frozen 
   and not v.pd_all_visible;

Feel free to contact me to help with such kind of issue (to fix and to find root cause).

@DauletT
Copy link

DauletT commented Nov 8, 2022

Thank you very much!

@JuniorCaldeira
Copy link

Thank you very much, you helped me a lot

@lingyanmeng
Copy link

Thanks!

@azhinu
Copy link

azhinu commented Jun 8, 2023

Really helpful.

@benzimmer
Copy link

Hey, thanks for this!
In the intro you write

Furthermore, it can happen in case of incorrect setup, as well.

Do you have examples for "incorrect setup"s?

@supix
Copy link
Author

supix commented Sep 4, 2023

@benzimmer,

Do you have examples for "incorrect setup"s?

for example giving not enough resources (e.g. memory or disk space) to run the database.

@aamir-mansoori
Copy link

Since we can get the culprit record by running above shell, after that we can run below blob to get the culprit column by passing the id and then update the culprit column with NULL.

DO
$$
DECLARE
col_list record;
column_name varchar; -- The dynamic column name
column_value varchar; -- Variable to store the retrieved column data
BEGIN
FOR col_list IN
SELECT t.table_name, t.column_name
FROM information_schema."columns" t
WHERE t.table_schema = 'your-schema-name'
AND t.table_name ='your-table-name'
ORDER BY ordinal_position
LOOP
BEGIN
-- Set the dynamic column name
column_name := col_list.column_name;

        -- Construct and execute a dynamic SQL query
        EXECUTE 'SELECT ' || column_name || ' FROM your-table-name WHERE id = 1234' 
        INTO column_value;

        -- Print the retrieved column value
        RAISE NOTICE 'Column % is ok for data %', column_name, column_value;
    EXCEPTION
        WHEN others THEN
            -- Handle exceptions here
            column_value := '-1';
            RAISE EXCEPTION 'Corrupted data occurred for column %: %', column_name, SQLERRM;
    END;
END LOOP;

END;
$$
LANGUAGE plpgsql;

@praisin
Copy link

praisin commented Mar 1, 2024

Thanks for sharing.

@ckazi
Copy link

ckazi commented Apr 22, 2024

I wrote a utility in Golang that solves this problem out of the box. Here's the link: https://github.com/ckazi/chunky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment