androidfred/database_pointers.md

## database_pointers.md

      
    Raw
  

              database_pointers.md
            
          
    Database pointers

I used to think creating and managing a db and a schema for a given service was easy, and that I knew all about how to do it... but I was wrong! There are actually quite a few things that need to be kept in mind in order to avoid easily preventable issues when working with databases and how they interact with your code and tests - this post will cover a bunch. Parts of it is pretty technology specific (eg AWS RDS and MySQL specific), but much of it is universally applicable for all db technologies so well worth reading even if you use something else.
I hope you will find it helpful - and that you will comment and point out any errors or things you think are missing that are worth mentioning!
Creating and managing databases

At Expedia, AWS RDS is commonly used. One of the main points of AWS RDS is to make creating and managing databases easy, but there are actually still a surprising amount of complexity and potential pitfalls that should be kept in mind when using RDS, much of which isn't at all made clear in AWS own (or other) documentation and tutorials.
Be careful with CloudFormation

If your RDS was created using CloudFormation, subsequently tweaking the CloudFormation down the track can result in your RDS being deleted and recreated from scratch, losing all data in the process. It goes without saying this could be disastrous, be very, very careful!
Version

Eg, in absence of a specified version of MySQL, AWS will default to MySQL 5.6, which was released in 2013 and is no longer maintained. RDS’s should not be created without specifying a version, or specifying the version as MySQL 5.6 or 5.7 – use MySQL 8.
Default settings

Different databases have different defaults (and some have better defaults than others), but make sure to sanity check and tweak the settings  according to best practices and your use case.
Eg, in MySQL, the default charset is latin1 (meaning your db will not properly store or render anything other than a weird hodgepodge of latin characters) and the default collation is latin1_swedish_ci (meaning your db will be unable to select and sort in any sensible way) which is more or less guaranteed to not be what you want.
Do not use any of the non mb4 utf8 charsets or collations. They are not actually utf8 - utf8mb4 (which is actually utf8) has replaced them.
Do not use the utf8mb4_general_ci collation. It’s incorrectly implemented and utf8mb4_unicode_ci has replaced it.
Note that the utf8mb4_bin (=binary, exact matching) collation may be more appropriate than utf8mb4_unicode_ci for certain use cases, but not for general purpose use. So using utf8mb4_unicode_ci as the default and specifying utf8mb4_bin for individual columns as required when creating tables is recommended. In depth explanation of charsets and collations are beyond the scope of this Gist, but as a brief example:


id
name


1
Zoë


2
Zoe


query
collation
result


select * from user where name = 'zoë';
utf8mb4_unicode_ci
both users


select * from user where name = 'zoe';
utf8mb4_unicode_ci
both users


select * from user where name = 'Zoë';
utf8mb4_unicode_ci
both users


select * from user where name = 'Zoe';
utf8mb4_unicode_ci
both users


select * from user where name = 'zoë';
utf8mb4_bin
no user


select * from user where name = 'zoe';
utf8mb4_bin
no user


select * from user where name = 'Zoë';
utf8mb4_bin
only Zoë


select * from user where name = 'Zoe';
utf8mb4_bin
only Zoe


Ie utf8mb4_unicode_ci is good for general purpose use (names etc) and utf8mb4_bin is good when exact matching is required. (usernames etc)
Actually, optimally (from a technical point of view, not necessarily a user experience point of view, but still) things like usernames would be constrained to case-insensitive ASCII, so you don't end up with a user zoe, another user Zoe etc etc. But if the cat is already out of the bag and there are already such similar usernames, matching exactly becomes extremely important. Imagine if you implemented a login service that didn't properly disambiguate between the username zoë and the username zoe, or between the username zoe and Zoe etc etc!
In AWS RDS, default settings for RDS instances are controlled through parameter groups - more on that will follow.
Parameter groups

Unfortunately, AWS don’t make it very clear that creating dedicated parameter groups for exclusive use by each individual db BEFORE actually creating the db is a de facto prerequisite
Parameter groups are a broad range of settings for the db, including things like eg what the default charset and collation is.
If a db is created without dedicated parameter groups, it gets assigned the default parameter groups. The default parameter groups have the bad default settings described earlier, which you don't want. And even if the default parameter groups did have better defaults, the default parameter groups still shouldn’t be used, because other devs may make changes (outside your knowledge and control) to the default parameter groups, those changes will then be reflected in your db irrespective of whether you want them to or not! This is why creating dedicated parameter groups for every db – before actually creating the db – is required.
There are two types of parameter group: "DB Parameter Group" and "DB Cluster Parameter Group"- a dedicated one of each is required for every db.
So in case you wanted to create an Aurora MySQL 5.7 db for foo-service, start by creating a dedicated foo-db "Parameter Group" and a dedicated foo-db "DB Cluster Parameter Group". (NOTE while Aurora MySQL 5.7 will be used as an example, more or less the same issues are relevant for other dbs as well, so do read this)
Which parameters to override with what depends on the db used. For Aurora MySQL 5.7, some more or less required overrides for the foo-db DB Cluster Parameter Group are


Parameter
override
default.aurora-mysql5.7


binlog_format
MIXED
OFF


character_set_client
utf8mb4


character_set_connection
utf8mb4


character_set_database
utf8mb4


character_set_filesystem
binary


character_set_results
utf8mb4


character_set_server
utf8mb4


collation_connection
utf8mb4_unicode_ci


collation_server
utf8mb4_unicode_ci


server_audit_events
CONNECT,QUERY_DCL,QUERY_DDL,TABLE


server_audit_logging
1
0


log_bin_trust_function_creators
1


server_audit_logs_upload
1
0


The foo-db Parameter Group does not need any overrides- but a dedicated foo-db Parameter Group for the exclusive use of the foo-db is still needed for reasons already mentioned.
If you do not create dedicated parameter groups for exclusive use by your db before creating the db, and the db is created using the default parameter groups, you CAN create the desired parameter groups and modify your db to use them even after the db has been created, but it requires a hard restart of the db and the changes in charset, collation etc will not take effect for already created schemas- the defaults for already created schemas will be whatever they were at the time the schemas were created- so you will have to manually go through each created schema and change them per schema, possibly also by table and column, as required.
You do not want to do this- it’s always better to create the dedicated parameter groups for exclusive use by your db first and create the db pointing to those parameter groups in the first place.
Security

Creds

Many dbs have common and widely known default creds for the root user, eg "root" or "master" or "admin". Optimally, use a different one when creating the db, and optimally, store it securely and never expose it.
Obviously, also choose a strong password, and obviously, store it securely and never expose it. The AWS reference documentation for creating an RDS using CloudFormation (https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/quickref-rds.html) contains this unsafe snippet, which leaves the creds (including the password!) in plain text in the CloudFormation file.
"MyDB" : {
"Type" : "AWS::RDS::DBInstance",
"Properties" : {
"DBSecurityGroups" : [
{"Ref" : "MyDbSecurityByEC2SecurityGroup"}, {"Ref" : "MyDbSecurityByCIDRIPGroup"} ],
"AllocatedStorage" : "5",
"DBInstanceClass" : "db.t2.small",
"Engine" : "MySQL",
"MasterUsername" : "MyName",
"MasterUserPassword" : "MyPassword"
},
"DeletionPolicy" : "Snapshot"
}

RDS’s should not be created using this method or any method that ever exposes creds for the root user in plain text anywhere. Even if the creds are immediately changed, there’s still a window of time where an attacker could log in using the plain text creds and create another user which can still be used to access the db even after the root user creds are changed.
There are other problems with that snippet as well, eg: in absence of the a specified version of MySQL, AWS will default to MySQL 5.6 (which is very old and is no longer maintained), also, it doesn't use dedicated parameter groups.
Last but not least, while AWS use the term "master", that doesn't mean we have to - when referring to the "master" creds, prefer using terms like "admin" or "root" instead in order to avoid needlessly awkward situations in a diverse work environment.
IAM Authentication

The root creds should not be used by your service to connect to the db.
The old fashioned way to wire your service to the db is by logging in to the db using the root user and root user password, creating a dedicated service user for the service and assign it the required (but not more, eg no create or drop table permissions) permissions on the relevant schema, and then adding the service user username and password (via Vault, AWS Secrets Manager or something similar) to the service property files.
The new way to wire up your service to the db is by using IAM, which works much the same but doesn’t require storing the password for the service user anywhere.
Database design

These are just some general pointers, I'm sure more could be thought of!
Don’t assume eg the default VARCHAR length of 255 characters will accommodate the range of possible values for a text column – check your domain and tweak the column types accordingly.
Don’t use non-human readable columns (including varbinary for id columns) unless there’s good reason. Trivial performance benefits are not good reason!
Don’t use auto-incremented, sequential guessable id’s. (at least not if they’re going to be exposed – if they’re merely a db-internal identifier that never gets exposed, it’s not as problematic from a security point of view)
Don’t use prefixes like table_
 or tbl_<table_name> – it’s a table, no need to add a redundant, unidiomatic prefix.
Don’t use abbreviations like usr_ppt_mrkt_mgr – figure out proper, idiomatic names for tables. If table names feel like they have to be abbreviated because they’re long combinations of words, that’s a smell that the table probably does too much and can and should be split into individual, atomic, dedicated tables.
Don’t turn tables into a hodgepodge of non-essential columns that don’t actually constitute the core thing the table represents. (especially if a column is nullable, that’s a strong indication it doesn’t really belong in the table)
Eg, looking at this user table, there’s a strong sensation that the best_friend_user_id and favorite_band don’t really have anything to do with what actually constitutes a user
mysql> select * from user;

| user_id | username | best_friend_user_id | favorite_band |
| xxx | foo | null | null |
| yyy | bar | zzz | null |
| zzz | smurf | yyy | The Smurfs |
...
instead, create individual, atomic, dedicated tables for each thing
mysql> select * from best_friends;

| first_user | second_user |
| xxx | xxx |
...
mysql> select * from favorite_bands;

| user_id | favorite_band |
| xxx | The Smurfs |
...
This will make your code types cleaner as well – instead of
data class (
val userId: String,
val username: String,
val bestFriend: String?
)
the user will now only contain what actually constitutes a user (and they will be non nullable, ie guaranteed to always be present as well!)
data class (
val userId: String,
val username: String
)
and you can get the potentially absent best friend and favorite band from dedicated functions
fun findBestFriend(val user: User) : User?
fun findFavoriteBand(val user: User) : String?
that looks in the appropriate tables.
Database migrations and state
Liquibase vs Flyway
There are different tools for managing database migrations and state, and which one you use and how is to some degree a matter of taste. That said, I want to make the case for using Flyway over Liquibase.
Here’s what Flyway migrations look like
$ tree
.
└── migration
├── V1__bootstrap_tables.sql
├── V2__role.sql
├── V3__role_trigger.sql
├── V4__role_insert.sql
├── V5__grant_and_other_tables.sql
├── V6__grant_and_other_triggers.sql
├── V7__insert_types.sql
└── V8__insert_more_member_types.sql
literally just a simple sequence of scripts that make it immediately obvious what changes will be applied in what order.
Also, Flyway runs all scripts on app boot. This has huge benefits:


if any of the scripts fail, the app doesn’t boot and you’ll know immediately. This makes it more or less impossible to end up with non-working changes (including due to incorrect sequence of execution) on the main branch. (because your feature branch won’t build!) If, despite that, non-working changes still made it in to the main branch somehow, non-breaking changes won’t be deployed, because again, the app canary won’t boot in neither test nor prod.


it also enables you to know the state of the scripts – any script that has made it to a given environment will have run in that environment! Simple.


it enables you to combine code changes that depend on a db migration together with their corresponding db migration in a single cohesive change. (vs having to do the db changes first in one change and then do the code changes that depend on the db migration in a subsequent change)

Note also: no folder structure, metafiles, manual script execution etc etc necessary, so the possibility of errors to do with any of that is completely eliminated.
Here’s what the Liquibase migrations for this repo looked like before we switched over to Flyway
$ tree
.
└── changelog
├── README.md
├── db.changelog-master.yaml
├── grant
│ ├── changelog.yaml
│ └── v.01
│ ├── 2019-05-02_01_create_grant.sql
│ ├── 2019-05-28_01_Update_index.sql
│ ├── 2019-09-20_01_drop_table.sql
│ └── changelog.yaml
├── portfolio
│ ├── changelog.yaml
│ ├── v.01
│ │ ├── 2019-05-02_01_create_db.sql
│ │ └── changelog.yaml
│ ├── v.02
│ │ ├── 2019-09-26_01_portfolio_idempotent.sql
│ │ └── changelog.yaml
│ └── v.03
│ ├── 2020_11_18_01_portfolio_resource_col.sql
│ └── changelog.yaml
├── ram
│ ├── changelog.yaml
│ └── v.01
│ ├── 2020-11-18_01_create_user_resource.sql
│ ├── 2021-02-17_01_add_deleted_column.sql
│ └── changelog.yaml
├── resourcepolicy
│ ├── changelog.yaml
│ └── v.01
│ ├── 2019-07-23_01_create_schema.sql
│ ├── 2019-07-25_01_add_constraint.sql
│ ├── 2019-07-29_01_add_id.sql
│ ├── 2019-11-09_01_subject_index.sql
│ ├── 2020-09-03_01_nullable_partner.sql
│ └── changelog.yaml
└── role
├── changelog.yaml
└── v.01
├── 2020-12-10_01_create_role.sql
├── 2020-12-10_01_insert_role.sql
├── 2020-12-10_01_role_audit_h2.trg.sql
├── 2020-12-10_01_role_audit_mysql.trg.sql
└── changelog.yaml
Way more complex. While I can see where wanting to organize things into subfolders according to domain or functionality or whatever comes from, it’s questionable whether the added complexity makes it worth it.
It’s common when using Liquibase to defer execution of scripts, ie the scripts are not automatically run during app boot, instead, an endpoint is hit specifying what script to run and when.
Intuitively this sounds great since it enables complete control and customizability of what to run when. Unfortunately it quickly turns into a nightmare because


since the scripts aren’t run during app boot, non-working changes can easily make their way into the main branch and all the way to prod without anyone noticing they don’t work... until they’re manually run – only then will you find out there’s something wrong with the script, and now you have to figure out how to deal with the non-working script.


It means you don’t know the state of the scripts – a script could have made it out into the main branch and all the way to prod but only have run in test, or not at all, or only in prod... Dangerous! In the case of this app, some of the above scripts had run in all environments, some had not been run in any environment, yet others had only ran in test but not prod... Needlessly and extremely confusing.


It makes it impossible to combine code changes that depend on a db migration together with their corresponding db migration in a single cohesive change. If your app boots with code changes that depend on the script having been executed, and the changes haven’t actually been executed, the app will break. So you need to first make just the db change, execute it everywhere, and only then add the code that depends on it. It needlessly splits things that could be a single cohesive change into needlessly complex and confusing chunks.

Note also metafiles etc etc. In general, there’s just a lot more complexity and different ways to very easily cause problems.
I know Liquibase is more "powerful", eg it supports "rollbacks" etc out of the box. But in my opinion, this comparison is a really good example of the power of simplicity – Flyway being less powerful is a feature, not a bug, because its simplicity completely eliminates many sources of easily preventable errors.
As for "rollbacks", there’s really no such thing as a "rollback" when it comes to database schemas. It’s kind of like git – there’s only an ever progressing sequence of changes. A later change can effectively "override" a previous one, but the previous one wasn’t really "undone". In Flyway, if you need to change some previous change, you simply add a new change that puts you in the desired state.
Check out Flyway here: https://github.com/flyway/flyway
Note that it’s pretty easy to migrate to Flyway from Liquibase even if the db has already been created. Just create a V1__bootstrap.sql file with "create table if not exists" statements, triggers etc to reflect the current state of the db and that’s it.
Database testing
H2 vs MariaDB4J
H2 has advantages over connecting to eg a real RDS instance to run tests, including being much faster, less resource intensive, and, well, self contained, as you don’t need an external db.
Unfortunately, with H2 you quickly run into problems due to differences in supported syntax between whatever db dialect you’re actually using and H2. In many cases, that means eg custom code-based triggers have to be written for H2 which are not guaranteed to behave the same as the real MySQL triggers.
So at the end of the day, H2 is really fast and efficient at... giving you a potentially false sense of security your changes will work as expected in real environments.
An alternative middle ground between connecting to eg a real RDS instance to run tests and H2 is MariaDB4J.
Like H2, it’s an embedded, in memory, ephemeral db, used for testing, but unlike H2, it’s actually a "real" db that supports all the same features as whatever you’re actually using in prod, and doesn’t require workarounds like code-based triggers etc. So while it is a bit slower than H2, it gives much better guarantees that your changes will work as expected in real environments.
Check out MariaDB4J here: https://github.com/vorburger/MariaDB4j
Note that it’s pretty easy to migrate to MariaDB4J from H2 even if the project was based on H2. Just pull in the plugin and change the test config files to point to the MariaDB4J url for the db connections and that’s it.
Why is this all so needlessly complicated
I don’t know! Most of us probably think creating RDS’s is an easy, streamlined process that doesn’t require much thought (as mentioned in the intro, I know I did!), but unfortunately, there’s actually quite a few things that need to be kept in mind. It’s unfortunate that dbs don't have better setting by defaults, that docs aren't better, etc etc, but it is what it is and we just have to work around it, paying attention to detail!
query	collation	result
`select * from user where name = 'zoë';`	`utf8mb4_unicode_ci`	both users
`select * from user where name = 'zoe';`	`utf8mb4_unicode_ci`	both users
`select * from user where name = 'Zoë';`	`utf8mb4_unicode_ci`	both users
`select * from user where name = 'Zoe';`	`utf8mb4_unicode_ci`	both users
`select * from user where name = 'zoë';`	`utf8mb4_bin`	no user
`select * from user where name = 'zoe';`	`utf8mb4_bin`	no user
`select * from user where name = 'Zoë';`	`utf8mb4_bin`	only Zoë
`select * from user where name = 'Zoe';`	`utf8mb4_bin`	only Zoe
Parameter	override	default.aurora-mysql5.7
binlog_format	MIXED	OFF
character_set_client	`utf8mb4`
character_set_connection	`utf8mb4`
character_set_database	`utf8mb4`
character_set_filesystem	`binary`
character_set_results	`utf8mb4`
character_set_server	`utf8mb4`
collation_connection	`utf8mb4_unicode_ci`
collation_server	`utf8mb4_unicode_ci`
server_audit_events	CONNECT,QUERY_DCL,QUERY_DDL,TABLE
server_audit_logging	1	0
log_bin_trust_function_creators	1
server_audit_logs_upload	1	0