Skip to content

Instantly share code, notes, and snippets.

@fredbenenson
Last active December 18, 2024 13:32
Show Gist options
  • Save fredbenenson/7bb92718e19138c20591 to your computer and use it in GitHub Desktop.
Save fredbenenson/7bb92718e19138c20591 to your computer and use it in GitHub Desktop.
Kickstarter SQL Style Guide
layout title description tags
default
SQL Style Guide
A guide to writing clean, clear, and consistent SQL.
data
process

Purpose

Maintaining reproducibility and transparency is a core value of Kickstarter's Data team, and a SQL style guide can help us achieve that goal. Additionally, adhering to the basic rules in this style guide will improve our ability to share, maintain, and extend our research when working with SQL.

This document is written as a manual for anyone working on the Data team, but also as a guide for anyone at the company who would like to write clean and clear code that is meant to be shared.

The individual tips in this guide are based on a composite of knowledge we've gleaned from experience and our roles at previous jobs.

NOTE: This style guide is written for use with AWS Redshift/Postgres 8.0.2, but much of it can be applied to any SQL database.

Principles

  • We take a disciplined and practical approach to writing code.
  • We regularly check-in code to Github
  • We believe consistency in style is very important.
  • We demonstrate intent explicitly in code, via clear structure and comments where needed.

Rules

General stuff

  • No tabs. 2 spaces per indent.

  • No trailing whitespace.

  • Always capitalize SQL keywords (e.g., SELECT or AS)

  • Variable names should be underscore separated:

    GOOD: SELECT COUNT(*) AS backers_count

    BAD: SELECT COUNT(*) AS backersCount

  • Comments should go near the top of your query, or at least near the closest SELECT

  • Try to only comment on things that aren't obvious about the query (e.g., why a particular ID is hardcoded, etc.)

  • Don't use single letter variable names be as descriptive as possible given the context:

    GOOD: SELECT ksr.backings AS backings_with_creators

    BAD: SELECT ksr.backings AS b

  • Use Common Table Expressions (CTEs) early and often, and name them well.

  • HAVING isn't supported in Redshift, so use CTEs instead. If you don't know what this means, ask a friendly Data Team member.

SELECT

Align all columns to the first column on their own line:

SELECT
  projects.name,
  users.email,
  projects.country,
  COUNT(backings.id) AS backings_count
FROM ...

SELECT goes on its own line:

SELECT
  name,
  ...

Always rename aggregates and function-wrapped columns:

SELECT
  name,
  SUM(amount) AS sum_amount
FROM ...

Always rename all columns when selecting with table aliases:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.backings AS backings
INNER JOIN ksr.projects AS projects ON ...

Always use AS to rename columns:

GOOD:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
...

BAD:

SELECT
  projects.name project_name,
  COUNT(backings.id) backings_count
...

Long Window functions should be split across multiple lines: one for the PARTITION, ORDER and frame clauses, aligned to the PARTITION keyword. Partition keys should be one-per-line, aligned to the first, with aligned commas. Order (ASC, DESC) should always be explicit. All window functions should be aliased.

SUM(1) OVER (PARTITION BY category_id,
                          year
             ORDER BY pledged DESC
             ROWS UNBOUNDED PRECEDING) AS category_year

FROM

Only one table should be in the FROM. Never use FROM-joins:

GOOD:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON backings.project_id = projects.id
...

BAD:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects, ksr.backings AS backings
WHERE
  backings.project_id = projects.id
...

JOIN

Explicitly use INNER JOIN not just JOIN, making multiple lines of INNER JOINs easier to scan:

GOOD:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON ...
INNER JOIN ...
LEFT JOIN ksr.backer_rewards AS backer_rewards ON ...
LEFT JOIN ...

BAD:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
JOIN ksr.backings AS backings ON ...
LEFT JOIN ksr.backer_rewards AS backer_rewards ON ...
LEFT JOIN ...

Additional filters in the INNER JOIN go on new indented lines:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON projects.id = backings.project_id
  AND backings.project_country != 'US'
...

The ON keyword and condition goes on the INNER JOIN line:

SELECT
  projects.name AS project_name,
  COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON projects.id = backings.project_id
...

Begin with INNER JOINs and then list LEFT JOINs, order them semantically, and do not intermingle LEFT JOINs with INNER JOINs unless necessary:

GOOD:

INNER JOIN ksr.backings AS backings ON ...
INNER JOIN ksr.users AS users ON ...
INNER JOIN ksr.locations AS locations ON ...
LEFT JOIN ksr.backer_rewards AS backer_rewards ON ...
LEFT JOIN ...

BAD:

LEFT JOIN ksr.backer_rewards AS backer_rewards ON backings
INNER JOIN ksr.users AS users ON ...
LEFT JOIN ...
INNER JOIN ksr.locations AS locations ON ...

WHERE

Multiple WHERE clauses should go on different lines and begin with the SQL operator:

SELECT
  name,
  goal
FROM ksr.projects AS projects
WHERE
  country = 'US'
  AND deadline >= '2015-01-01'
...

CASE

CASE statements aren't always easy to format but try to align WHEN, THEN, and ELSE together inside CASE and END:

CASE WHEN category = 'Art'
     THEN backer_id
     ELSE NULL
END

Common Table Expressions (CTEs)

From AWS:

WITH clause subqueries are an efficient way of defining tables that can be used throughout the execution of a single query. In all cases, the same results can be achieved by using subqueries in the main body of the SELECT statement, but WITH clause subqueries may be simpler to write and read.

The body of a CTE must be one indent further than the WITH keyword. Open them at the end of a line and close them on a new line:

WITH backings_per_category AS (
  SELECT
    category_id,
    deadline,
    ...
)

Multiple CTEs should be formatted accordingly:

WITH backings_per_category AS (
  SELECT
    ...
), backers AS (
  SELECT
    ...
), backers_and_creators AS (
  ...
)
SELECT * FROM backers;

If possible, JOIN CTEs inside subsequent CTEs, not in the main clause:

GOOD:

WITH backings_per_category AS (
  SELECT
    ...
), backers AS (
  SELECT
    backings_per_category.backer_id,
    COUNT(backings_per_category.id) AS projects_backed_per_category
  INNER JOIN ksr.users AS users ON users.id = backings_per_category.backer_id
  GROUP BY backings_per_category.backer_id
), backers_and_creators AS (
  ...
)
SELECT * FROM backers_and_creators;

BAD:

WITH backings_per_category AS (
  SELECT
    ...
), backers AS (
  SELECT
    backer_id,
    COUNT(backings_per_category.id) AS projects_backed_per_category
), backers_and_creators AS (
  ...
)
SELECT * FROM backers_and_creators
INNER JOIN backers ON backers_and_creators ON backers.backer_id = backers_and_creators.backer_id

Always use CTEs over inlined subqueries.

Tips

@emrecolako
Copy link

Great guide, one quick comment on this

HAVING isn't supported in Redshift, so use CTEs instead. If you don't know what this means, ask a friendly Data Team member.

HAVING is actually supported in Redshift here

@hannahflaherty
Copy link

hannahflaherty commented Sep 14, 2020

This got linked in a newsletter - happy to see it, I agree with most of it, especially the non-formatting rules.

But I don't think this, from the CTE join rule, is valid SQL:

WITH backings_per_category AS (
  SELECT
    ...
), backers AS (
  SELECT
    backer_id,
    COUNT(backings_per_category.id) AS projects_backed_per_category
  INNER JOIN ksr.users AS users ON users.id = backings_per_category.backer_id
), backers_and_creators AS (
  ...
)
SELECT * FROM backers_and_creators;

The middle CTE, backers, is missing a FROM and a GROUP BY. It should look like this:

...
backers AS (
  SELECT
    backer_id,
    COUNT(backings_per_category.id) AS projects_backed_per_category
  FROM backings_per_category
    INNER JOIN ksr.users AS users ON users.id = backings_per_category.backer_id
  GROUP BY backer_id
)
...

The only other rule I'd add to this list is that any query joining two or more tables or CTEs together must always use table aliases for every single field reference, so a later query reader can always be sure which table a given field comes from. For example, it's not clear whether backer_id in the code above comes from the previous CTE or from the users table.

@fredbenenson
Copy link
Author

Nice catch @hannahflaherty! And yes, that's a good edit too. CTEs are great but can definitely get confusing if you don't explicitly rename them.

Curious what newsletter you saw this in.

@hannahflaherty
Copy link

Curious what newsletter you saw this in.

It showed up on last week's Data Elixir.

@fredbenenson
Copy link
Author

Sweet! Thanks – signed up and made those edits you suggested.

@benjaminkaplanphd
Copy link

Great guide!
I am trying to make the case to my team for always using AS to alias columns and tables. What did you base this decision on? Are there any related blog posts?

@fredbenenson
Copy link
Author

@benjaminkaplanphd Here's the strongest case I can make: it is just more explicit that you're creating an alias. If you don't use AS then it can just read like a random variable name. I think it can be handy as a hint when skimming to remember that a particular table or CTE Is being named AS something different.

@bhtucker
Copy link

To add to that, it clarifies that you didn't just miss a comma and intend to select another field :)

@maximilian22x
Copy link

maximilian22x commented Oct 18, 2021

Why you do not use this CASE format?

CASE WHEN category = 'Art'      THEN backer_id
     WHEN category = 'Whatever' THEN backer_something_else ELSE NULL
 END as column_name

Its seems to me to better understandable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment