Skip to content

Instantly share code, notes, and snippets.

@tateisu
Last active February 24, 2019 05:40
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tateisu/c3c13f2580961e86f79889a17f57260a to your computer and use it in GitHub Desktop.
Save tateisu/c3c13f2580961e86f79889a17f57260a to your computer and use it in GitHub Desktop.
Removing local copies of unreferenced external posts

Removing local copies of unreferenced external posts

日本語 => https://gist.github.com/tateisu/3d98290f2b72d12ba5f1b977a0d5743c

Summary

SQL exapmle that deletes local copies of external posts flowing from the FTL or relays and that are not referenced long time by local users.

Note: It is "reduction" of data increase , not "to be able to operate infinitely with a fixed disk capacity".

Background

while running Mastodon instance, toots data gradually accumulates. I understand that my instance should keep records of local user's own posts and external posts that local users are interested in. However, there is no reason to long keep records of the external posts such as "posts by external users who are no longer followed" and "posts by external users unconditionally flowing through relays". Can we delete the copy of unreferenced external posts for reducing the increase in recorded data? Especially with relay, Can we avoid to judge like "Do not use relays because it increasing posts data too much"?

Precondition

  • PostgreSQL 9.5 or later.
  • Prepare in advance tootctl media remove.

Create Index

Since this SQL is very slow, please make indexes in advance.

"local user" partial index

create index index_accounts_local on accounts(id) where domain is null;

You can erase it after work.

"status id of pinned post" index

create index index_status_pins_status_id on status_pins(status_id);

You can erase it after work.

"remote_url of media attachment" index

create index index_media_attachments_remote_url on media_attachments(remote_url) where remote_url is not null;

After deleting statuses, PgHero shows slow query like as SELECT "media_attachments".* FROM "media_attachments" WHERE "media_attachments"."status_id" IS NULL AND "media_attachments"."remote_url" = ? ORDER BY "media_attachments"."id" ASC LIMIT ?. I made index to avoid this. I do not know what conditions this query will occur due to lack of survey, I can not be said clearly when we can delete this index.

SQL that deletes old unreferenced external posts.

DELETE FROM statuses WHERE id in (
SELECT id FROM statuses
where not local 
and id < (EXTRACT(EPOCH FROM ((TIMESTAMP 'now')-interval '100 days')) * 65536000)::bigint
and reblog_of_id is null
and in_reply_to_id is null
and id not in ( select status_pins.status_id from status_pins where statuses.id = status_id)
and id not in ( select mentions.status_id from mentions where statuses.id = status_id)
and id not in ( select sb.in_reply_to_id from statuses as sb where statuses.id = sb.in_reply_to_id)
and id not in ( select favourites.status_id from favourites where statuses.id = status_id
  and favourites.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
and id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
  and sb.account_id in ( select accounts.id from accounts where accounts.domain is null) 
)
and account_id not in ( select follows.target_account_id from follows where statuses.account_id = follows.target_account_id
  and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
order by id asc limit 2000
);

Since the DELETE command has an exclusive lock of the entire table, it causes other processes to stagnate. This query may be quite slow (depending on server specs), we should limit the number of deletions at once.

Also, when used in scripts etc., you can get "list of deleted status IDs" by appending returning id at the end of delete statement.

Explanation of each condition

Check the query conditions separately in the following sections.

Note: About the using of not in instead of inner join ~ where null or not exists. It has impact on the execution plan of the query on PostgreSQL 9.6.x . When using not exists, Merge Anti join ~ Materialize is often used as an execution plan, and it runs subqueries even outside the specified range of status ID. then not exists is rewritten with not in, SubPlan will be processed to Cond Check inside the range scan of the statuses table instead of join. If you look at cost alone, it becomes heavy, but the rows handled in subquery is drastically reduced.

Not local post

Corresponds to not local .

Posts and boosts by local users should not be deleted.

Well old

Corresponds to id < (EXTRACT(EPOCH FROM ((TIMESTAMP 'now')-interval '100 days')) * 65536000)::bigint .

We should delete posts only older than the number of days specified in tootctl media remove_remote (default 7 days).

Note: Mastodon's status ID is Snowflake ID, multiplying unix time by 65536000 will replace date range specification to ID range specification.

Note: now() is not a constant expression, but (TIMESTAMP 'now') is constant expression.

Not pinned

Corresponds to id not in ( select status_pins.status_id from status_pins where statuses.id = status_id) .

Pinned statuses are a kind of "current state" displayed in the user profile, so should not be deleted even if it is old.

Not boost

Corresponds to reblog_of_id is null .

Deleting a post that expresses a boost affects the display of "people boosting a post". Since posts expressing Boost do not flowing through Relay or FTL, there is no aggressive reason to delete.

Not a reply, Not replied, Not contain mentions

Corresponds to :

in_reply_to_id is null
and id not in ( select mentions.status_id from mentions where statuses.id = status_id)
and id not in ( select sb.in_reply_to_id from statuses as sb where statuses.id = sb.in_reply_to_id)

Local user may be involved only in a small part of the conversation tree, in which case it is desirable for local user to be able to view the entire conversation tree. Therefore, it should not be deleted.

Note: I do not mind deleting a conversation tree where the local user is not involved at all, but this time the script does not make such a complicated judgment, all conversation tree or mentions is not deleted.

Not followed by local users

Corresponds to :

account_id not in ( select follows.target_account_id from follows where statuses.account_id = follows.target_account_id
  and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
)

Not favorited by local users

Corresponds to :

id not in ( select favourites.status_id from favourites where statuses.id = status_id
  and favourites.account_id in ( select accounts.id from accounts where accounts.domain is null)
)

Not boosted by local users

Corresponds to :

id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
  and sb.account_id in ( select accounts.id from accounts where accounts.domain is null) 
)

Not boosted by remote users who followed by local users

This is not included in the query example above.

If this condition is not included, so when boosted posts are deleted Posts indicating that boosting is also deleted because db/schema.rb hasadd_foreign_key 'statuses', 'statuses', column: 'reblog_of_id', on_delete:: cascade. As a result, it seems that the boost disappears in the account TL of the corresponding remote user.

Some people may oppose the "boost by followers" can not be seen. If you want to add this condition, can you write as? :

id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
  and ( sb.account_id in ( select accounts.id from accounts where accounts.domain is null ) 
     or sb.account_id in ( select follows.target_account_id from follows where sb.account_id = follows.target_account_id
       and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
     )
)

Impact after post deleted

  • this script does not delete posts including conversations and mentions, then the equivalent to Statuses.unlink_from_conversations should not be necessary.
  • All other foreign keys that references status id should have on_delete:: cascade or on_delete:: nullify definition.
  • The media attachments will lost relation to status. it will be deleted by MediaCleanupScheduler that scheduled on sidekiq.
@tateisu
Copy link
Author

tateisu commented Dec 17, 2018

script example that repeatly delete posts.

  • using delete ~ returning id to get id of previous execution
  • use it in id range spec of next execution.
#!/usr/bin/perl --
use strict;
use warnings;


sub formatDuration($){
    my($t)=@_; 
    my $h = int($t/3600); $t = $t % 3600;
    my $m = int($t/60); $t = $t % 60;
    my $s = $t;
    return ($h>0?"${h}h":"").($m>0?"${m}m":"")."${s}s";
}

sub runQuery($){
    my( $query )= @_;
    $query =~ s/\s+/ /g;

    exit if not $query;

    print( "".localtime," $query\n");

    my $t = time;
    my $r = `docker exec mastodon1_db_backend_1 psql -U postgres postgres -A -q -t -c "$query" &2>&1`;
    $t = time -$t;
    print( "duration=",formatDuration($t),"\n");

    return $r;
}

# コマンドライン引数に 前回処理し終えたIDを渡すと、その続きから処理する
my $minId = 0+(shift // 0); 

for(1..5000){

    my $r = runQuery <<"END";
DELETE FROM statuses WHERE id in (
SELECT id FROM statuses
where not local 
and id < (EXTRACT(EPOCH FROM ((TIMESTAMP 'now')-interval '100 days')) * 65536000)::bigint
and id > $minId
and reblog_of_id is null
and in_reply_to_id is null
and id not in ( select status_pins.status_id from status_pins where statuses.id = status_id)
and id not in ( select mentions.status_id from mentions where statuses.id = status_id)
and id not in ( select sb.in_reply_to_id from statuses as sb where statuses.id = sb.in_reply_to_id)
and id not in ( select favourites.status_id from favourites where statuses.id = status_id
  and favourites.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
and id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
  and sb.account_id in ( select accounts.id from accounts where accounts.domain is null) 
)
and account_id not in ( select follows.target_account_id from follows where statuses.account_id = follows.target_account_id
  and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
order by id asc limit 2000
) returning id;
END

    my $count =0;
    for my $col (split /\s+/,$r){
        if( not $col =~ /\A\d+\z/ ){
            print $r;
            exit;
        }
        my $id = 0+$col;
        ++$count;
        $minId = $id if $id > $minId;
    }

    if(not $count){
        print $r;
        exit;
    }

    my $when ="";
    my $t = ($minId>>16)/1000;
    if( $t >= 1490972400 ){
        my @lt = localtime($t);
        $lt[5]+=1900;$lt[4]+=1;
        $when = sprintf("%d-%02d-%02d_%02d:%02d:%02d",reverse @lt[0..5]);
    }

    print "count=$count,lastId=$minId,when=$when\n";
}

@tateisu
Copy link
Author

tateisu commented Dec 17, 2018

I tried running it at mastodon.juggler.jp.

  • status count(before): 11702014
  • status count(after): 10480120

It is not an exact figure because there are increased posts during the execution, but about 11% posts could be deleted.

This time it should have deleted "posts by people who is no longer followed by local users" in the time,
but if I try at half a year later, I should be able to delete also "Toot that came from relay but not touched at all by local users".

@Gargron
Copy link

Gargron commented Feb 16, 2019

Not favorited by local users

I believe that subquery can be simplified by removing the local accounts check, because Mastodon does not store favourites by remote accounts on remote statuses. By selecting remote statuses, we already guarantee all favourites will be local.

Not contain mentions

I believe here the local accounts check on the subquery should be added, because we do not care about remote statuses mentioning only remote accounts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment