Skip to content

Instantly share code, notes, and snippets.

@streamer45
Last active February 9, 2022 11:17
Show Gist options
  • Save streamer45/59b3582118913d4fc5e8ff81ea78b055 to your computer and use it in GitHub Desktop.
Save streamer45/59b3582118913d4fc5e8ff81ea78b055 to your computer and use it in GitHub Desktop.
Mattermost v6.0 DB Schema Migrations Analysis

Mattermost v6.0 DB Schema Migrations Analysis

Intro

Major version 6.0 introduces several database schema changes that aim at improving both database and application performance. As a natural consequence of that, the impact of the migration process has been affected.

We've conducted extensive tests on both supported db drivers (MySQL/PostgreSQL) using realistic datasets (10M+ posts). Our results, along with possible mitigation strategies, are outlined in this document.

DB Specs

MySQL

  • 1 db.r5.2xlarge instance running Aurora MySQL (5.7.mysql_aurora.2.10.0)

PostgreSQL

  • 1 db.r5.2xlarge instance running Aurora PostgreSQL (11.7)

Data Size

+------------------------+----------+
| Table                  | Rows     |
+------------------------+----------+
| Posts                  | 11470022 |
| ChannelMembers         |  4503095 |
| Threads                |   831332 |
| Channels               |   264168 |
| Users                  |     9698 |
| LinkMetadata           |     1600 |
| Status                 |     1002 |
| Sessions               |     1002 |
| Jobs                   |      242 |
+------------------------+----------+

Queries

Note

Queries are in order of execution during the in-app migration process with the exception of MySQL queries #4, #11, #26 that are combined in a single statement to minimize running time. On top are the queries executed first. Longest durations are highlighted.

MySQL

# Query Duration Impact
1 ALTER TABLE ChannelMembers MODIFY NotifyProps JSON; ~2m Moderate spike in DB CPU usage. metadata lock on ChannelMembers table. No write operations are possible for the whole duration of the query.
2 ALTER TABLE LinkMetadata MODIFY Data JSON; ~50ms No impact noticed. metadata lock on LinkMetadata table. No write operations are possible for the whole duration of the query.
3 ALTER TABLE Jobs MODIFY Data JSON; ~90ms No impact noticed. metadata lock on Jobs table. No write operations are possible for the whole duration of the query.
4 ALTER TABLE Posts MODIFY Props JSON; ~26m Significant spike in DB CPU usage. metadata lock on Posts table. No write operations are possible for the whole duration of the query.
5 ALTER TABLE Sessions MODIFY Props JSON; ~100ms No impact noticed. metadata lock on Sessions table. No write operations are possible for the whole duration of the query.
6 ALTER TABLE Threads MODIFY Participants JSON; ~20s Moderate spike in DB CPU usage. metadata lock on Threads table. No write operations are possible for the whole duration of the query.
7 ALTER TABLE Users MODIFY Props JSON; ~2s No impact noticed. metadata lock on Users table. No write operations are possible for the whole duration of the query.
8 ALTER TABLE Users MODIFY NotifyProps JSON; ~2s No impact noticed. metadata lock on Users table. No write operations are possible for the whole duration of the query.
9 ALTER TABLE Users MODIFY Timezone JSON; ~2s No impact noticed. metadata lock on Users table. No write operations are possible for the whole duration of the query.
10 UPDATE Posts SET RootId = ParentId WHERE RootId = '' AND RootId != ParentId; ~20s Minimal impact noticed. A next-key lock on Posts table is created which generally doesn't cause issues writing nor reading.
11 ALTER TABLE Posts DROP COLUMN ParentId; ~26m Significant spike in DB CPU usage. metadata lock on Posts table. No write operations are possible for the whole duration of the query.
12 UPDATE CommandWebhooks SET RootId = ParentId WHERE RootId = '' AND RootId != ParentId; ~1ms Minimal impact noticed. A next-key lock on CommandsWebhooks table is created which generally doesn't cause issues writing nor reading.
13 ALTER TABLE CommandWebhooks DROP COLUMN ParentId; ~1ms No impact noticed.
14 CREATE INDEX idx_posts_root_id_delete_at ON Posts (RootId, DeleteAt); ~40s Minimal impact noticed. In place operation. Doesn't cause any significant locking. Both read and write operations to Posts table are possible during query execution.
15 DROP INDEX idx_posts_root_id ON Posts; ~62ms No impact noticed.
16 CREATE INDEX idx_channels_team_id_display_name ON Channels (TeamId, DisplayName); ~800ms No impact noticed.
17 CREATE INDEX idx_channels_team_id_type ON Channels (TeamId, Type); ~700ms No impact noticed.
18 DROP INDEX idx_channels_team_id ON Channels; ~30ms No impact noticed.
19 CREATE INDEX idx_threads_channel_id_last_reply_at ON Threads (ChannelId, LastReplyAt); ~3s Minimal impact noticed. In place operation. Doesn't cause any significant locking. Both read and write operations to Threads table are possible during query execution.
20 DROP INDEX idx_threads_channel_id ON Threads; ~30ms No impact noticed.
21 CREATE INDEX idx_channelmembers_user_id_channel_id_last_viewed_at ON ChannelMembers (UserId, ChannelId, LastViewedAt); ~20s Minimal impact noticed. In place operation. Doesn't cause any significant locking. Both read and write operations to ChannelMembers table are possible during query execution.
22 CREATE INDEX idx_channelmembers_channel_id_scheme_guest_user_id ON ChannelMembers (ChannelId, SchemeGuest, UserId); ~20s Minimal impact noticed. In place operation. Doesn't cause any significant locking. Both read and write operations to ChannelMembers table are possible during query execution.
23 DROP INDEX idx_channelmembers_user_id ON ChannelMembers; ~1ms No impact noticed.
24 CREATE INDEX idx_status_status_dndendtime ON Status (Status, DNDEndTime); ~40ms No impact noticed.
25 DROP INDEX idx_status_status ON Status; ~30ms No impact noticed.
26 ALTER TABLE Posts MODIFY COLUMN FileIds text; ~26m Significant spike in DB CPU usage. metadata lock on Posts table. No write operations are possible for the whole duration of the query.

PostgreSQL

# Query Duration Impact
1 ALTER TABLE channelmembers ALTER COLUMN notifyprops TYPE jsonb USING notifyprops::jsonb; ~40s Significant spike in DB CPU usage. ACCESS EXCLUSIVE lock on ChannelMembers table. No read nor write operations are possible for the whole duration of the query.
2 ALTER TABLE jobs ALTER COLUMN data TYPE jsonb USING data::jsonb; ~20ms ACCESS EXCLUSIVE lock on Jobs table. No read nor write operations are possible for the whole duration of the query.
3 ALTER TABLE linkmetadata ALTER COLUMN data TYPE jsonb USING data::jsonb; ~40ms ACCESS EXCLUSIVE lock on LinkMetadata table. No read nor write operations are possible for the whole duration of the query.
4 ALTER TABLE posts ALTER COLUMN props TYPE jsonb USING props::jsonb; ~11m Significant spike in DB CPU usage. ACCESS EXCLUSIVE lock on Posts table. No read nor write operations are possible for the whole duration of the query.
5 ALTER TABLE sessions ALTER COLUMN props TYPE jsonb USING props::jsonb; ~100ms ACCESS EXCLUSIVE lock on Sessions table. No read nor write operations are possible for the whole duration of the query.
6 ALTER TABLE threads ALTER COLUMN participants TYPE jsonb USING participants::jsonb; ~7s ACCESS EXCLUSIVE lock on Threads table. No read nor write operations are possible for the whole duration of the query.
7 ALTER TABLE users ALTER COLUMN props TYPE jsonb USING props::jsonb; ~1s ACCESS EXCLUSIVE lock on Users table. No read nor write operations are possible for the whole duration of the query.
8 ALTER TABLE users ALTER COLUMN notifyprops TYPE jsonb USING notifyprops::jsonb; ~1s ACCESS EXCLUSIVE lock on Users table. No read nor write operations are possible for the whole duration of the query.
9 ALTER TABLE users ALTER COLUMN timezone TYPE jsonb USING timezone::jsonb; ~1s ACCESS EXCLUSIVE lock on Users table. No read nor write operations are possible for the whole duration of the query.
10 UPDATE Posts SET RootId = ParentId WHERE RootId = '' AND RootId != ParentId; ~3s Minimal impact noticed. Locking on Posts table is of type ROW EXCLUSIVE and generally doesn't cause issues writing nor reading.
11 ALTER TABLE Posts DROP COLUMN ParentId; ~3ms No impact noticed.
12 UPDATE CommandWebhooks SET RootId = ParentId WHERE RootId = '' AND RootId != ParentId; ~1ms No impact noticed.
13 ALTER TABLE CommandWebhooks DROP COLUMN ParentId; ~5ms No impact noticed.
14 CREATE INDEX idx_posts_root_id_delete_at ON Posts (RootId, DeleteAt); ~13s Locking on Posts table is of type SHARE making writes operations not possible for the whole duration of the query.
15 DROP INDEX idx_posts_root_id; ~23ms No impact noticed.
16 CREATE INDEX idx_channels_team_id_display_name ON Channels (TeamId, DisplayName); ~500ms Locking on Channels table is of type SHARE making writes operations not possible for the whole duration of the query.
17 CREATE INDEX idx_channels_team_id_type ON Channels (TeamId, Type); ~250ms Locking on Channels table is of type SHARE making writes operations not possible for the whole duration of the query.
18 DROP INDEX idx_channels_team_id; ~4ms No impact noticed.
19 CREATE INDEX idx_threads_channel_id_last_reply_at ON Threads (ChannelId, LastReplyAt); ~2s Locking on Threads table is of type SHARE making writes operations not possible for the whole duration of the query.
20 DROP INDEX idx_threads_channel_id; ~7ms No impact noticed.
21 CREATE INDEX idx_channelmembers_user_id_channel_id_last_viewed_at ON ChannelMembers (UserId, ChannelId, LastViewedAt); ~10s Locking on ChannelMembers table is of type SHARE making writes operations not possible for the whole duration of the query.
22 CREATE INDEX idx_channelmembers_channel_id_scheme_guest_user_id ON ChannelMembers (ChannelId, SchemeGuest, UserId); ~11s Locking on ChannelMembers table is of type SHARE making writes operations not possible for the whole duration of the query.
23 DROP INDEX idx_channelmembers_user_id; ~17ms No impact noticed.
24 CREATE INDEX idx_status_status_dndendtime ON Status (Status, DNDEndTime); ~14ms Locking on Status table is of type SHARE making writes operations not possible for the whole duration of the query.
25 DROP INDEX idx_status_status; ~6ms No impact noticed.
26 ALTER TABLE posts ALTER COLUMN fileids TYPE character varying(300); ~4ms No impact noticed.

Mitigation Strategies

In order to minimize the impact (and possible downtime) of the migration, a couple of strategies can be used.

Several queries can be ran prior to the upgrade (preferably off bussiness hours). Once the actual migration happens such queries will run instantly.

Queries which result won't affect the application are #1,#2,#3,#4,#5,#6,#7,#8,#14,#16,#17,#19,#21,#22,#24,#26. Such queries are considered safe to be run prior to the upgrade.

Note 1

Please make sure to have properly backed-up your database before performing any of the suggested. In case of an unpredicted failure you should be in a position to load a previous db snapshot.

Note 2

Running queries prior to the upgrade can save some downtime but it can still cause full table locking (refer to the above tables for detailed information).

Note 3 (MySQL only)

Query #4 and #26 can be combined when run prior to the upgrade as follows:

ALTER TABLE Posts MODIFY COLUMN FileIds text, MODIFY COLUMN Props JSON;

This will limit the time taken to that of a single query of that type.

Online migration (MySQL)

An online migration that avoids locking can be attempted on MySQL installations, especially for particularly heavy queries like #4, #11 and #26. This can be done through an external tool like pt-online-schema-change.

Sample execution on combined queries #4 and #26
ubuntu@ip-172-27-238-89:~$ time pt-online-schema-change --alter "MODIFY COLUMN FileIds text, MODIFY COLUMN Props JSON" D=mattermost,t=Posts -h mattermost-db-host -pmattermost-db-pass -u mmuser --execute
No slaves found.  See --recursion-method if host ip-10-1-3-144 has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `mattermost`.`Posts`...
Creating new table...
Created new table mattermost._Posts_new OK.
Altering new table...
Altered `mattermost`.`_Posts_new` OK.
2021-08-19T07:34:58 Creating triggers...
2021-08-19T07:34:58 Created triggers OK.
2021-08-19T07:34:58 Copying approximately 11500485 rows...
Copying `mattermost`.`Posts`:   2% 23:10 remain
Copying `mattermost`.`Posts`:   4% 23:18 remain
Copying `mattermost`.`Posts`:   6% 23:00 remain
Copying `mattermost`.`Posts`:   8% 22:49 remain
Copying `mattermost`.`Posts`:  10% 22:24 remain
Copying `mattermost`.`Posts`:  11% 22:02 remain
Copying `mattermost`.`Posts`:  13% 21:36 remain
Copying `mattermost`.`Posts`:  15% 21:12 remain
Copying `mattermost`.`Posts`:  17% 20:42 remain
Copying `mattermost`.`Posts`:  19% 20:12 remain
Copying `mattermost`.`Posts`:  21% 19:44 remain
Copying `mattermost`.`Posts`:  23% 19:17 remain
Copying `mattermost`.`Posts`:  25% 18:51 remain
Copying `mattermost`.`Posts`:  27% 18:23 remain
Copying `mattermost`.`Posts`:  29% 17:56 remain
Copying `mattermost`.`Posts`:  31% 17:27 remain
Copying `mattermost`.`Posts`:  33% 16:59 remain
Copying `mattermost`.`Posts`:  35% 16:29 remain
Copying `mattermost`.`Posts`:  37% 16:01 remain
Copying `mattermost`.`Posts`:  39% 15:33 remain
Copying `mattermost`.`Posts`:  41% 15:03 remain
Copying `mattermost`.`Posts`:  42% 14:35 remain
Copying `mattermost`.`Posts`:  44% 14:06 remain
Copying `mattermost`.`Posts`:  46% 13:38 remain
Copying `mattermost`.`Posts`:  48% 13:10 remain
Copying `mattermost`.`Posts`:  50% 12:42 remain
Copying `mattermost`.`Posts`:  52% 12:13 remain
Copying `mattermost`.`Posts`:  54% 11:44 remain
Copying `mattermost`.`Posts`:  56% 11:16 remain
Copying `mattermost`.`Posts`:  58% 10:46 remain
Copying `mattermost`.`Posts`:  60% 10:18 remain
Copying `mattermost`.`Posts`:  61% 09:49 remain
Copying `mattermost`.`Posts`:  63% 09:19 remain
Copying `mattermost`.`Posts`:  65% 08:51 remain
Copying `mattermost`.`Posts`:  67% 08:21 remain
Copying `mattermost`.`Posts`:  69% 07:53 remain
Copying `mattermost`.`Posts`:  71% 07:24 remain
Copying `mattermost`.`Posts`:  73% 06:55 remain
Copying `mattermost`.`Posts`:  75% 06:26 remain
Copying `mattermost`.`Posts`:  77% 05:57 remain
Copying `mattermost`.`Posts`:  78% 05:28 remain
Copying `mattermost`.`Posts`:  80% 04:58 remain
Copying `mattermost`.`Posts`:  82% 04:29 remain
Copying `mattermost`.`Posts`:  84% 04:00 remain
Copying `mattermost`.`Posts`:  86% 03:31 remain
Copying `mattermost`.`Posts`:  88% 03:01 remain
Copying `mattermost`.`Posts`:  90% 02:32 remain
Copying `mattermost`.`Posts`:  92% 02:03 remain
Copying `mattermost`.`Posts`:  93% 01:34 remain
Copying `mattermost`.`Posts`:  95% 01:05 remain
Copying `mattermost`.`Posts`:  97% 00:36 remain
Copying `mattermost`.`Posts`:  99% 00:07 remain
2021-08-19T08:01:43 Copied rows OK.
2021-08-19T08:01:43 Analyzing new table...
2021-08-19T08:01:43 Swapping tables...
2021-08-19T08:01:43 Swapped original and new tables OK.
2021-08-19T08:01:43 Dropping old table...
2021-08-19T08:01:44 Dropped old table `mattermost`.`_Posts_old` OK.
2021-08-19T08:01:44 Dropping triggers...
2021-08-19T08:01:44 Dropped triggers OK.
Successfully altered `mattermost`.`Posts`.

real	26m46.228s
user	0m1.288s
sys	0m0.185s

Note 1

Query #11 does not guarantee safety once it terminates. In fact, the server (on version 5.x) will start failing to create new posts. A restart on the new version (6.0) is hence required after that finishes.

Note 2

Similarly to the original migration query, the online migration process can cause a significant spike in CPU usage on the DB instance it runs.

Downgrading (6.0.0 --> 5.38.x/5.39.x)

Dowgrading the schema can be achieved by running the following set of queries.

Note 1

Performance impact of the downgrade process is similar to the original migration. You should refer to the sections above for more information.

Note 2

Inverse of query #26 is purposely missing from downgrade queries since its result is backwards compatible and running it would unnecessarily delay the downgrade process.

MySQL

INSERT INTO Systems (Name,Value) VALUES ('Version','5.38.0') ON DUPLICATE KEY UPDATE Value = '5.38.0';

CREATE INDEX idx_status_status ON Status (Status);
DROP INDEX idx_status_status_dndendtime ON Status;
CREATE INDEX idx_channelmembers_user_id ON ChannelMembers (UserId);
DROP INDEX idx_channelmembers_channel_id_scheme_guest_user_id ON ChannelMembers;
DROP INDEX idx_channelmembers_user_id_channel_id_last_viewed_at ON ChannelMembers;
CREATE INDEX idx_threads_channel_id ON Threads (ChannelId);
DROP INDEX idx_threads_channel_id_last_reply_at ON Threads;
CREATE INDEX idx_channels_team_id ON Channels (TeamId);
DROP INDEX idx_channels_team_id_type ON Channels;
DROP INDEX idx_channels_team_id_display_name ON Channels;
CREATE INDEX idx_posts_root_id ON Posts (RootId);
DROP INDEX idx_posts_root_id_delete_at ON Posts;

ALTER TABLE CommandWebhooks ADD COLUMN ParentId varchar(26);
UPDATE CommandWebhooks SET ParentId = '';
ALTER TABLE Posts ADD COLUMN ParentId varchar(26);
UPDATE Posts SET ParentId = '';

ALTER TABLE Users MODIFY Timezone text;
ALTER TABLE Users MODIFY NotifyProps text;
ALTER TABLE Users MODIFY Props text;
ALTER TABLE Threads MODIFY Participants longtext;
ALTER TABLE Sessions MODIFY Props text;
ALTER TABLE Posts MODIFY Props text;
ALTER TABLE Jobs MODIFY Data text;
ALTER TABLE LinkMetadata MODIFY Data text;
ALTER TABLE ChannelMembers MODIFY NotifyProps text;

PostgreSQL

INSERT INTO Systems (Name,Value) VALUES ('Version','5.38.0') ON CONFLICT (name) DO UPDATE SET Value = '5.38.0';

CREATE INDEX idx_status_status ON Status (Status);
DROP INDEX idx_status_status_dndendtime;
CREATE INDEX idx_channelmembers_user_id ON ChannelMembers (UserId);
DROP INDEX idx_channelmembers_user_id_channel_id_last_viewed_at;
DROP INDEX idx_channelmembers_channel_id_scheme_guest_user_id;
CREATE INDEX idx_threads_channel_id ON Threads (ChannelId);
DROP INDEX idx_threads_channel_id_last_reply_at;
CREATE INDEX idx_channels_team_id ON Channels (TeamId);
DROP INDEX idx_channels_team_id_type;
DROP INDEX idx_channels_team_id_display_name;
CREATE INDEX idx_posts_root_id ON Posts (RootId);
DROP INDEX idx_posts_root_id_delete_at;

ALTER TABLE CommandWebhooks ADD COLUMN ParentId varchar(26);
UPDATE CommandWebhooks SET ParentId = '';
ALTER TABLE Posts ADD COLUMN ParentId varchar(26);
UPDATE Posts SET ParentId = '';

ALTER TABLE users ALTER COLUMN timezone TYPE varchar(256);
ALTER TABLE users ALTER COLUMN notifyprops TYPE varchar(2000);
ALTER TABLE users ALTER COLUMN props TYPE varchar(4000);
ALTER TABLE threads ALTER COLUMN participants TYPE text;
ALTER TABLE sessions ALTER COLUMN props TYPE varchar(1000);
ALTER TABLE posts ALTER COLUMN props TYPE varchar(8000);
ALTER TABLE linkmetadata ALTER COLUMN data TYPE varchar(4096);
ALTER TABLE jobs ALTER COLUMN data TYPE varchar(1024);
ALTER TABLE channelmembers ALTER COLUMN notifyprops TYPE varchar(2000);
@bbodenmiller
Copy link

Do queries need to be run in order or for example could just #4 be run before upgrade?

@streamer45
Copy link
Author

@bbodenmiller

Generally speaking the order doesn't matter but there are a few exceptions, for example query #10 needs to happen before query #11. Also for indexes related queries it would be generally better to create the new one prior to deleting anything.

Coming to query #4, while it's safe to run before upgrade, given it's the most heavy query of the lot, it can run for a while on big datasets and will cause full table locking, essentially making the application read-only for the duration of the query. Again, if you are running MySQL I'd suggest to combine all queries directly affecting the Posts table to limit the overall time taken, e.g.:

ALTER TABLE Posts MODIFY COLUMN FileIds text, MODIFY COLUMN Props JSON, DROP COLUMN ParentId;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment