Purpose: This notebook is a style contract for AI-assisted code generation and review. When referenced from another notebook (e.g.,
refer to @pyspark_style_guide), you MUST apply every rule below to all PySpark code produced or reviewed in that session.Adapted from Palantir PySpark Style Guide (MIT License).
Use features and API supported by following versions:
- Spark 3.5
- Delta 3.2
- Python 3.11
When reviewing or generating PySpark code, walk through each check below in order. Flag every violation found. Do not skip checks.
| # | Check | What to look for |
|---|---|---|
| C1 | Imports | Any bare from pyspark.sql.functions import ... or alias other than F, T, W |
| C2 | Column access | Any df.colName dot-access outside of a join on= clause |
| C3 | String column refs | Any F.col('x') that could just be 'x' (Spark 3.0+) |
| C4 | Variable names | Any single-letter dataframe names (df, o, d, t) |
| C5 | Magic values | Any literal string, number, or threshold inline in filter, when, withColumn, select that is not a named constant |
| C6 | Select contract | More than one function per column in a select, or a .when() expression inside a select |
| C7 | withColumnRenamed | Any use. Replace with select + .alias() |
| C8 | Empty columns | Any lit(''), lit('NA'), lit('N/A'). Must be lit(None) |
| C9 | Logical density | More than 3 boolean expressions in a single .filter() or F.when() without named variables |
| C10 | Chain length | More than 5 chained statements in one block |
| C11 | Chain mixing | Joins, filters, withColumn, and selects mixed in the same chain |
| C12 | Join hygiene | Any .join() missing explicit how= |
| C13 | Right joins | Any how='right'. Swap df order, use left |
| C14 | Window frames | Any Window.partitionBy(...).orderBy(...) without explicit .rowsBetween() or .rangeBetween() |
| C15 | Window nulls | F.first() or F.last() without ignorenulls=True |
| C16 | Global windows | Empty W.partitionBy() or window without orderBy used for aggregation. Use .agg() instead |
| C17 | Otherwise fallback | .otherwise(<catch-all value>) masking unexpected data. Use None or omit |
| C18 | Line continuation | Any \ for multiline. Wrap in parentheses instead |
| C19 | UDFs | Any @udf or F.udf(). Rewrite with native functions |
| C20 | Comments | Comments that describe what code does instead of why a decision was made |
| C21 | Dead code | Commented-out code blocks. Remove them |
| C22 | Function size | Functions over ~70 lines or files over ~250 lines |
Each pattern below is a regex-like signature. If you see it, it is a violation.
# VIOLATION: any of these
from pyspark.sql.functions import col, when, sum, lit
import pyspark.sql.functions as func
# FIX: always
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W# VIOLATION: df.column_name anywhere except join on=
df.select(df.order_id, df.amount)
df.withColumn('x', df.price * df.qty)
# FIX: use string refs
df.select('order_id', 'amount')
df.withColumn('x', F.col('price') * F.col('qty'))# VIOLATION: bare literals in logic
df.filter(F.col('amount') > 500)
F.when(F.col('status') == 'shipped', 'In Transit')
df.filter(F.col('days') < 365)
# FIX: named constants at top of cell/function
HIGH_VALUE_THRESHOLD = 500
STATUS_SHIPPED = 'shipped'
LABEL_IN_TRANSIT = 'In Transit'
ONE_YEAR_DAYS = 365
df.filter(F.col('amount') > HIGH_VALUE_THRESHOLD)
F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT)
df.filter(F.col('days') < ONE_YEAR_DAYS)# VIOLATION: more than 3 conditions inline
df.filter(
(F.col('a') == 'x') & (F.col('b') > 10) & (F.col('c') != 'y')
& ((F.col('d') == 'online') | (F.col('d') == 'partner'))
)
# FIX: named boolean expressions, max 3 in the final filter
is_valid_type = (F.col('a') == TYPE_X)
above_threshold = (F.col('b') > MIN_THRESHOLD)
not_excluded = (F.col('c') != EXCLUDED_STATUS)
is_target_channel = (F.col('d') == CHANNEL_ONLINE) | (F.col('d') == CHANNEL_PARTNER)
flagged = is_valid_type & above_threshold & not_excluded & is_target_channel
df.filter(flagged)# VIOLATION: conditional logic embedded in select
df.select(
'order_id',
F.when(F.col('status') == 'shipped', 'In Transit')
.when(F.col('status') == 'delivered', 'Complete')
.alias('status_label'),
)
# FIX: select plain columns, then withColumn for derived logic
df = df.select('order_id', 'status')
df = df.withColumn(
'status_label',
F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT)
.when(F.col('status') == STATUS_DELIVERED, LABEL_COMPLETE)
)# VIOLATION
df.withColumn('notes', F.lit(''))
df.withColumn('review_date', F.lit('N/A'))
# FIX
df.withColumn('notes', F.lit(None))
df.withColumn('review_date', F.lit(None))# VIOLATION: implicit frame
w = W.partitionBy('customer_id').orderBy('order_date')
# FIX: always explicit
w = (W.partitionBy('customer_id')
.orderBy('order_date')
.rowsBetween(W.unboundedPreceding, 0))# VIOLATION: masks unexpected values
F.when(..., 'A').when(..., 'B').otherwise('Unknown')
# FIX: omit otherwise (returns null) or use lit(None) explicitly
F.when(..., 'A').when(..., 'B')# VIOLATION: mixed concerns, too long
df = (df.select(...).filter(...).withColumn(...).join(...).drop(...).withColumn(...))
# FIX: separate by concern, max 5 per block
df = (
df
.select(...)
.filter(...)
)
df = df.withColumn(...)
df = df.join(..., how='inner')# VIOLATION
df = df.filter(F.col('a') == 'x') \
.filter(F.col('b') > 10)
# FIX: parentheses
df = (
df
.filter(F.col('a') == 'x')
.filter(F.col('b') > 10)
)When writing new code, apply these defaults:
- Imports:
F,T,Wonly - Columns: string refs where possible,
F.col()when needed - Descriptive df names:
orders_df,active_orders, notdf,o - Constants: every literal in logic gets a
SCREAMING_SNAKEname - Selects: plain columns + one transform each, no
.when()inside - Chains: max 5 lines, group by concern (filter/select, then enrich, then join)
- Joins: always
how=, alwaysleftnotright, alias for disambiguation - Windows: always explicit frame, always
ignorenulls=Trueonfirst/last - Empty cols:
F.lit(None), neverlit('')orlit('NA') - No UDFs, no
.otherwise()fallbacks, no\continuations - Comments explain why, not what. No commented-out code.