Skip to content

Instantly share code, notes, and snippets.

@1ambda
Created December 20, 2021 15:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 1ambda/9fdaff2027350dcb9dafbe140c2481fc to your computer and use it in GitHub Desktop.
Save 1ambda/9fdaff2027350dcb9dafbe140c2481fc to your computer and use it in GitHub Desktop.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row
# DataBricks 로 실습한다면 경로를 "/FileStore/tables/marketing_campaign.csv" 로 변경합니다
df = spark.read.load("./marketing_campaign.csv",
format="csv",
sep="\t",
inferSchema="true",
header="true")
dfSelected = df.select(
col("ID").alias("id"),
col("Year_Birth").alias("year_birth"),
col("Education").alias("education"),
col("Kidhome").alias("count_kid"),
col("Teenhome").alias("count_teen"),
col("Dt_Customer").alias("date_customer"),
col("Recency").alias("days_last_login")
)
dfConverted = dfSelected.withColumn("date_joined",
add_months(to_date(col("date_customer"), "d-M-yyyy"), 72))
# Spark 가 파티션을 5개로 나누어 병렬처리 하도록 설정합니다.
dfPartitioned = dfConverted.repartition(5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment