Skip to content

Instantly share code, notes, and snippets.

@reosarevok
Last active November 25, 2021 12:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save reosarevok/2d8a2fa6310cc0144cad043a9c566717 to your computer and use it in GitHub Desktop.
Save reosarevok/2d8a2fa6310cc0144cad043a9c566717 to your computer and use it in GitHub Desktop.

SQL style guide

This is the style guide for SQL queries for the MusicBrainz Server project. It is based on https://github.com/meadmaker/sql-style-guide

Formatting

Keywords

  • Keywords (AS, FROM, SELECT, WHERE, etc) should be UPPERCASED, but...

  • Function names (array_agg, count, json_build_object) should be lowercased, even if some of them are officially keywords.

    /* Good */
    SELECT count(*) FROM url WHERE url LIKE '%wikidata%';
    
    /* Bad */
    select count(*) from url where url like '%wikidata%';

Names

  • Names for tables and columns (including aliases) should always be lowercased and contain only ASCII letters and underscores.

Indentation and newlines

  • Newlines should be used for any query that is at all complex or longer than 78 characters.

  • Each clause should begin a new line. SELECT, JOIN, LEFT JOIN, OUTER JOIN, WHERE, UNION, etc. are keywords that begin new clauses.

    /* Good */
    SELECT *
      FROM recording_first_release_date
     WHERE recording = ANY(?);
    
    /* Bad */
    SELECT * FROM recording_first_release_date
    WHERE recording = ANY(?);
  • The keywords that begin a clause should be right-aligned. The idea is to make a single character column between the keywords and their objects.

    /* Good */
    SELECT *
      FROM recording_first_release_date
     WHERE recording = ANY(?);
    
    /* Bad */
    SELECT *
    FROM recording_first_release_date
    WHERE recording = ANY(?);
  • Each column in a SELECT statement should be in a new line.

    /* Good */
    SELECT id,
           artist_credit,
           name
      FROM release;
    
    /* Bad */
    SELECT id, artist_credit, name
      FROM release;
  • Subqueries and other parentheses-enclosed blocks should be aligned as though the open parenthesis were the 0-column. So, they should be indented as a unit, to identify them as a block. Subqueries should continue to have the opening keywords right-aligned.

    /* Good */
    SELECT DISTINCT release.id AS release_id,
                    release.name,
                    row_number() OVER (ORDER BY release.name COLLATE musicbrainz)
               FROM (  SELECT medium.release,
                              min(medium.position) AS first_medium,
                              max(medium.position) AS last_medium,
                              count(medium.position) AS medium_count,
                              sum(medium.position) AS medium_pos_acc
                         FROM medium
                     GROUP BY medium.release) medium_sequence_data
               JOIN release ON release.id = medium_sequence_data.release
              WHERE first_medium != 1
                 OR last_medium != medium_count
                 OR (medium_count * (1 + medium_count)) / 2 <> medium_pos_acc;
    
    /* Bad */
    SELECT DISTINCT release.id AS release_id,
                    release.name,
                    row_number() OVER (ORDER BY release.name COLLATE musicbrainz)
               FROM (SELECT medium.release,
                            min(medium.position) AS first_medium,
                            max(medium.position) AS last_medium,
                            count(medium.position) AS medium_count,
                            sum(medium.position) AS medium_pos_acc
                   FROM medium
                   GROUP BY medium.release) medium_sequence_data
               JOIN release ON release.id = medium_sequence_data.release
              WHERE first_medium != 1
                 OR last_medium != medium_count
                 OR (medium_count * (1 + medium_count)) / 2 <> medium_pos_acc;
    /* Good */
    INSERT INTO release (id, gid, name, release_group, artist_credit)
         VALUES (1,
                 '7b906020-72db-11de-8a39-0800200c9a70',
                 'Release Group',
                 1,
                 1),
                (2,
                 '7c906020-72db-11de-8a39-0800200c9a71',
                 'Release Group',
                 1,
                 1);
    
    /* Bad */
    INSERT INTO release (id, gid, name, release_group, artist_credit)
         VALUES (1, '7b906020-72db-11de-8a39-0800200c9a70', 'Release Group', 1, 1),
           (2, '7c906020-72db-11de-8a39-0800200c9a71', 'Release Group', 1, 1);

Structure

  • Column aliases should always use the keyword AS This becomes significant when a query has several columns selected with columns aliased. Without the AS keyword, a dropped comma makes two columns become a single aliased column.

    /* Good */
    SELECT DISTINCT release.id AS release_id,
                    release.name,
                    row_number() OVER (ORDER BY release.name COLLATE musicbrainz)
               FROM release;
    
    /* Bad */
    SELECT DISTINCT release.id release_id,
                    release.name,
                    row_number() OVER (ORDER BY release.name COLLATE musicbrainz)
               FROM release;
    
  • Table aliases and column aliases should be descriptive. Much like variable names, "a", "b", "x", etc are not generally useful in and of themselves outside of short examples.

  • Tiny names for table aliases can sometimes work as abbreviations. As an example, if "releases" is referenced frequently, it might make sense to abbreviate it "r". However, "rel" is almost as short, and much more descriptive. Have a good reason for "r" instead of "rel".

  • Subquery aliases should be even more descriptive. Subqueries effectively create ad-hoc tables in memory. As such, if you name it "x", then there's absolutely nothing to suggest the intention behind the table to a later maintainer.

    /* Good */
    SELECT *
      FROM (  SELECT medium.release,
               [...]
            GROUP BY medium.release
      ) medium_sequence_data
      JOIN release ON release.id = medium_sequence_data.release;
    
    /* Bad */
    SELECT *
      FROM (  SELECT medium.release,
               [...]
            GROUP BY medium.release
      ) s
      JOIN release ON release.id = s.release;
    • Abbreviations don't help in subquery aliases

Queries inside Perl code

  • Newlines should be used for any query inside Perl code which would otherwise make its line longer than 78 characters.

  • When using newlines inside Perl code, prefer heredocs to multiline strings.

    /* Good */
    $c->sql->do(<<~'SQL');
        INSERT INTO artist (id, gid, name, sort_name)
             VALUES (1, 'a9d99e40-72d7-11de-8a39-0800200c9a66', 'Name', 'Name')
        SQL
    
    /* Bad */
    $c->sql->do(q{INSERT INTO artist (id, gid, name, sort_name)
                     VALUES (1, 'a9d99e40-72d7-11de-8a39-0800200c9a66', 'Name', 'Name')});
  • For long multiline queries that make use of Perl variables, prefer interpolation to appending strings, since it allows using heredoc blocks and usually improves readability.

    /* Good */
    my $tag_table = $self->tag_table;
    my $tag_table_entity_column = $self->type;
    my $query = <<~"SQL";
           SELECT tag.name,
                  entity_tag.count,
                  tag.id AS tag_id,
                  genre.id AS genre_id
             FROM $tag_table entity_tag
             JOIN tag ON tag.id = entity_tag.tag
        LEFT JOIN genre ON tag.name = genre.name
            WHERE $tag_table_entity_column = ?
         ORDER BY entity_tag.count DESC,
                  tag.name COLLATE musicbrainz
        SQL
    
    /* Bad */
    my $query = '   SELECT tag.name,
                           entity_tag.count,
                           tag.id AS tag_id,
                           genre.id AS genre_id
                      FROM ' . $self->tag_table . ' entity_tag
                      JOIN tag ON tag.id = entity_tag.tag
                 LEFT JOIN genre ON tag.name = genre.name
                     WHERE ' . $self->type . ' = ?
                  ORDER BY entity_tag.count DESC,
                           tag.name COLLATE musicbrainz';
@mwiencek
Copy link

I'd suggest function names should always be lowercase, even if they're considered keywords (e.g. min, max, count, avg, array_agg, etc. -- ones part of the SQL standard). This seems less confusing, because there are a lot of built-in postgres functions that aren't part of the SQL standard, and it would feel arbitrary to uppercase e.g. ARRAY_AGG but not JSON_BUILD_OBJECT. It also feels more consistent with user-defined functions, which are already always lowercased in our code.

@reosarevok
Copy link
Author

Sounds good, updated the Keywords section :)

One major issue I've found is what to do with stuff like several nested subqueries inside Perl. In theory, the current draft would call for them to be nested with indentation starting from the opening parenthesis, but that very quickly pushes them to fairly ridiculous lengths. One option would be to define subqueries as interpolated constants in those cases, another to change the way we indent them: possibly to have the keywords start where the values of the previous indent start, something like:

Another issue I've seen is stuff such as very long JOIN lines - if we decide to break a line such as JOIN very_long_table_name_here vltnh ON whatever_else.vltnh = vltnh.id AND vltnh.id = any(?);, how should that be indented?

Yet a third question I'd have is what to do with INSERT INTO ... VALUES when there's a lot of columns. Should we always break each column into its own line if they don't fit in one, or would it be ok to do something like

    INSERT INTO editor (
                  id, name, password, privs,
                  email, website, bio,
                  member_since, email_confirm_date, last_login_date,
                  ha1
                )
         VALUES (
                  5, 'adminymcadmin', '{CLEARTEXT}password', 128,
                  'test@email.com', 'http://test.website', 'biography',
                  '1989-07-23', '2005-10-20', '2013-04-05',
                  'aa550c5b01407ef1f3f0d16daf9ec3c8'
                );

with the caveat that the column names and values should match for each line we divide it into?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment