Skip to content

Instantly share code, notes, and snippets.

@joaopcnogueira
Last active November 3, 2023 02:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joaopcnogueira/de17e6d9c54c8e9a0827bca7c45967c2 to your computer and use it in GitHub Desktop.
Save joaopcnogueira/de17e6d9c54c8e9a0827bca7c45967c2 to your computer and use it in GitHub Desktop.
Piece of code demonstrating how to build regressive features for machine learning modeling, such as the sum of sales for the last 3 months, the mean of sales for the last 6 months and so on.
# Creating a spine table with three columns ----
# customer_id: identifier of the customer, for which we are going to predict the next month sales
# year_month: reference date
# sales: the metric we want to predict
spine_tbl <- tibble(
customer_id = c(rep("João", 24), rep("Denise", 24)),
year_month = c( seq( ymd("2021-11-01"), ymd("2023-10-01"), by = '1 month' ), seq( ymd("2021-11-01"), ymd("2023-10-01"), by = '1 month' ) ),
sales = sample(100:1000, 48, replace = TRUE)
)
spine_tbl %>%
# Arrange to properly build the rolling and aggregated features
arrange(customer_id, year_month) %>%
group_by(customer_id) %>%
# We use the “lag” function because we can get data up to the current reference
# date (year_month) and NOT including the current year_month sales. That is because
# in theory, we do not know the sales of the current year_month.
mutate(
sales1m = lag(sales),
sales2m = lag(sales, 2),
) %>%
na.omit() %>%
# Building the rolling and aggregated features
mutate(
sum_sales = cumsum(sales1m),
sum_sales3m = zoo::rollapply(sales1m, 3, sum, partial = TRUE, align = "right"),
sum_sales6m = zoo::rollapply(sales1m, 6, sum, partial = TRUE, align = "right"),
mean_sales = cummean(sales1m),
mean_sales3m = zoo::rollapply(sales1m, 3, mean, partial = TRUE, align = "right"),
mean_sales6m = zoo::rollapply(sales1m, 6, mean, partial = TRUE, align = "right"),
)
# Output ----
# A tibble: 44 × 11
# Groups: customer_id [2]
# customer_id year_month sales sales1m sales2m sum_sales sum_sales3m sum_sales6m mean_sales mean_sales3m mean_sales6m
# <chr> <date> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 Denise 2022-01-01 493 335 312 335 335 335 335 335 335
# 2 Denise 2022-02-01 726 493 335 828 828 828 414 414 414
# 3 Denise 2022-03-01 658 726 493 1554 1554 1554 518 518 518
# 4 Denise 2022-04-01 139 658 726 2212 1877 2212 553 626. 553
# 5 Denise 2022-05-01 626 139 658 2351 1523 2351 470. 508. 470.
# 6 Denise 2022-06-01 674 626 139 2977 1423 2977 496. 474. 496.
# 7 Denise 2022-07-01 524 674 626 3651 1439 3316 522. 480. 553.
# 8 Denise 2022-08-01 784 524 674 4175 1824 3347 522. 608 558.
# 9 Denise 2022-09-01 604 784 524 4959 1982 3405 551 661. 568.
# 10 Denise 2022-10-01 944 604 784 5563 1912 3351 556. 637. 558.
# ℹ 34 more rows
# So, for each customer, we can fit a machine learning algorithm to predict the sales
# for a specific year_month based on the following builded features:
# sales1m: last sales month
# sales2m: sales for the past 2 month
# sum_sales: cumulative sum, a.k.a total sales
# sum_sales3m: sum of the last 3 months
# sum_sales6m: sum of the last 6 months
# mean_sales: cumulative mean of sales
# mean_sales3m: mean sales of the last 3 months
# mean_sales6m: mean sales of the last 6 months
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment