layout | title | description | tags | ||
---|---|---|---|---|---|
default |
SQL Style Guide |
A guide to writing clean, clear, and consistent SQL. |
|
Maintaining reproducibility and transparency are core values to any data scientist, and a SQL style guide can help us achieve that goal. Additionally, adhering to the basic rules in this style guide will improve our ability to share, maintain, and extend our research when working with SQL.
This document is written as a manual for anyone working on the Data team, but also as a guide for anyone at the company who would like to write clean and clear code that is meant to be shared.
The individual tips in this guide are based on a composite of knowledge we've gleaned from experience and our roles at previous jobs.
NOTE: This style guide is written for use with AWS Redshift/Postgres 8.0.2, but much of it can be applied to any SQL database.
This guide is adapted from the Kickstarter SQL Style Guide.
- We take a disciplined and practical approach to writing code.
- We regularly check-in code to Github
- We believe consistency in style is very important.
- We demonstrate intent explicitly in code, via clear structure and comments where needed.
-
No tabs. 2 spaces per indent.
-
No trailing whitespace.
-
Always capitalize SQL keywords (e.g.,
SELECT
orAS
) -
Variable names should be underscore separated:
GOOD:
SELECT COUNT(*) AS backers_count
BAD:
SELECT COUNT(*) AS backersCount
-
Comments should go near the top of your query, or at least near the closest
SELECT
-
Try to only comment on things that aren't obvious about the query (e.g., why a particular ID is hardcoded, etc.)
-
Don't use single letter variable names be as descriptive as possible given the context:
GOOD:
SELECT ksr.backings AS backings_with_creators
BAD:
SELECT ksr.backings AS b
-
Use Common Table Expressions (CTEs) early and often, and name them well.
Align all columns to the first column on their own line:
SELECT
projects.name,
users.email,
projects.country,
COUNT(backings.id) AS backings_count
FROM ...
SELECT
goes on its own line:
SELECT
name,
...
Always rename aggregates and function-wrapped columns:
SELECT
name,
SUM(amount) AS sum_amount
FROM ...
Always rename all columns when selecting with table aliases:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.backings AS backings
INNER JOIN ksr.projects AS projects ON ...
Always use AS
to rename columns:
GOOD:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
...
BAD:
SELECT
projects.name project_name,
COUNT(backings.id) backings_count
...
Long Window functions should be split across multiple lines: one for the PARTITION
, ORDER
and frame clauses, aligned to the PARTITION
keyword. Partition keys should be one-per-line, aligned to the first, with aligned commas. Order (ASC
, DESC
) should always be explicit. All window functions should be aliased.
SUM(1) OVER (PARTITION BY category_id,
year
ORDER BY pledged DESC
ROWS UNBOUNDED PRECEDING) AS category_year
Only one table should be in the FROM
. Never use FROM
-joins:
GOOD:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON backings.project_id = projects.id
...
BAD:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects, ksr.backings AS backings
WHERE
backings.project_id = projects.id
...
Explicitly use INNER JOIN
not just JOIN
, making multiple lines of INNER JOIN
s easier to scan:
GOOD:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON ...
INNER JOIN ...
LEFT JOIN ksr.backer_rewards AS backer_rewards ON ...
LEFT JOIN ...
BAD:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
JOIN ksr.backings AS backings ON ...
LEFT JOIN ksr.backer_rewards AS backer_rewards ON ...
LEFT JOIN ...
Additional filters in the INNER JOIN
go on new indented lines:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON projects.id = backings.project_id
AND backings.project_country != 'US'
...
The ON
keyword and condition goes on the INNER JOIN
line:
SELECT
projects.name AS project_name,
COUNT(backings.id) AS backings_count
FROM ksr.projects AS projects
INNER JOIN ksr.backings AS backings ON projects.id = backings.project_id
...
Begin with INNER JOIN
s and then list LEFT JOIN
s, order them semantically, and do not intermingle LEFT JOIN
s with INNER JOIN
s unless necessary:
GOOD:
INNER JOIN ksr.backings AS backings ON ...
INNER JOIN ksr.users AS users ON ...
INNER JOIN ksr.locations AS locations ON ...
LEFT JOIN ksr.backer_rewards AS backer_rewards ON ...
LEFT JOIN ...
BAD:
LEFT JOIN ksr.backer_rewards AS backer_rewards ON backings
INNER JOIN ksr.users AS users ON ...
LEFT JOIN ...
INNER JOIN ksr.locations AS locations ON ...
Multiple WHERE
clauses should go on different lines and begin with the SQL operator:
SELECT
name,
goal
FROM ksr.projects AS projects
WHERE
country = 'US'
AND deadline >= '2015-01-01'
...
CASE
statements aren't always easy to format but try to align WHEN
, THEN
, and ELSE
together inside CASE
and END
:
CASE WHEN category = 'Art'
THEN backer_id
ELSE NULL
END
For nested CASE
statements, the nested CASE
statement should be moved to a new line and indented after the THEN
of the outer statement:
CASE
WHEN category = 'Art' THEN
CASE
WHEN subcategory = 'Fine Art' THEN fine_art_id
WHEN subcategory = 'Modernist' THEN modernist_id
ELSE backer_id
END
ELSE other_id
END
For multiple conditions within the WHEN
statement, use the following syntax:
CASE
WHEN
lead_tracking_division_name = 'Email'
AND lead_tracking_network_id != ''
AND LENGTH(${lead_tracking_network_id}) >= 7
THEN RIGHT(${lead_tracking_network_id}, 7)
WHEN lead_tracking_division_name = 'SMS'
AND lead_tracking_network_id != ''
AND LENGTH(${lead_tracking_network_id}) >= 6
THEN RIGHT(${lead_tracking_network_id}, 6)
ELSE NULL
END
Do not CAST
inside of a JOIN
expression
GOOD:
SELECT
CAST(lead.id as STRING)
...
FROM salesforce.lead AS lead
LEFT JOIN marketo.lead as marketo_lead ON marketo_lead.lead_id=lead.id
BAD:
SELECT
lead.id,
...
FROM salesforce.lead AS lead
LEFT JOIN marketo.lead as marketo_lead ON marketo_lead.lead_id=CAST(lead.id AS STRING)
Only use IN
when you explicitly enumerate the allowed values, do not use dynamic IN
statements.
GOOD:
SELECT
id,
company_name,
city,
country
FROM supplier
WHERE country IN ('USA', 'UK', 'Japan')
BAD:
SELECT
id,
company_name,
city,
country
FROM supplier
WHERE country IN (SELECT * FROM COUNTRIES)
When possible to evaluate a Boolean without a CASE
, do so:
GOOD:
SELECT
(country='US') AS domestic
BAD:
SELECT
(CASE WHEN country='US' THEN 1 ELSE 0 END) AS domestic
Booleans should always be true Booleans, not integer or string representations:
GOOD:
SELECT
(country='US') AS domestic
BAD:
SELECT
(CASE WHEN country='US' THEN TRUE ELSE FALSE END) AS domestic
Aggregates should coalesce to 0 (as opposed to NULL):
GOOD:
SELECT
MIN(col) || MIN(0) as asserted_col
WITH
clause subqueries are an efficient way of defining tables that can be used throughout the execution of a single query. In all cases, the same results can be achieved by using subqueries in the main body of theSELECT
statement, butWITH
clause subqueries may be simpler to write and read.
The body of a CTE must be one indent further than the WITH
keyword. Open them at the end of a line and close them on a new line:
WITH backings_per_category AS (
SELECT
category_id,
deadline,
...
)
Multiple CTEs should be formatted accordingly:
WITH backings_per_category AS (
SELECT
...
), backers AS (
SELECT
...
), backers_and_creators AS (
...
)
SELECT * FROM backers;
If possible, JOIN
CTEs inside subsequent CTEs, not in the main clause:
GOOD:
WITH backings_per_category AS (
SELECT
...
), backers AS (
SELECT
backer_id,
COUNT(backings_per_category.id) AS projects_backed_per_category
INNER JOIN ksr.users AS users ON users.id = backings_per_category.backer_id
), backers_and_creators AS (
...
)
SELECT * FROM backers_and_creators;
BAD:
WITH backings_per_category AS (
SELECT
...
), backers AS (
SELECT
backer_id,
COUNT(backings_per_category.id) AS projects_backed_per_category
), backers_and_creators AS (
...
)
SELECT * FROM backers_and_creators
INNER JOIN backers ON backers_and_creators ON backers.backer_id = backers_and_creators.backer_id
Always use CTEs over inlined subqueries.
- Prefer
UNION ALL
overUNION
(background) unless you really want to have duplicates in your unioned table. - Hesitate strongly before you use a
LIMIT
and NEVER use anORDER BY
. These clauses wreak havoc on distributed data warehouses. If you need to get the first/last values specifically, look to do that with a min/max instead ofLIMIT
ing.