jaymon0703/EDA_logistic_torch_DL.Rmd

## EDA_logistic_torch_DL.Rmd
---
title: "Jane_Street_Kaggle"
author: "Jasen Mackie"
date: "18/12/2020"
output: html_document
---

```{r setup, include=FALSE}
require(data.table)
require(ggplot2)
knitr::opts_chunk$set(echo = TRUE)
options(scipen=999)
knitr::opts_chunk$set(echo = TRUE, fig.align="center")
```

# Introduction

I have been meaning to compete in a Kaggle competition for some time. The [Jane Street Market Prediction competition](https://www.kaggle.com/c/jane-street-market-prediction) seemed like a good place to start considering my experience and interests. It was not until [Jonathan Larkin's post](https://twitter.com/jonathanrlarkin/status/1337062964562698241?s=20) however that i got real inspiration to start looking at the data more closely. Jonathan kindly shares a Kaggle notebook with the main objective being to introduce the user to PyTorch Lightning for applying a _Multi-Target_ approach using Deep Learning. The notebook uses a Python kernel. I am more of an R guy, and since _torch_ was [recently made available in R](https://blog.rstudio.com/2020/09/29/torch/) i figured a _somewhat_ replication of Jonathan's notebook would be a good place to start.

Before we get into it, a few notes about what this post is not:

- This is not a 100% complete replication of Jonathan's kindly shared notebook. I also add some additional notes.
- This is not a submission for the competition.
- This document has many TODOs, which i touch on at the end. Basically, no feature engineering nor hyperparameter tuning has been carried out, yet. This is left to the reader, including a future me.

Let's dig in...

# Data

The train.csv dataset (at the time of downloading) is large (6.2GB CSV). We will use data.table, as does Jonathan but the Python version.

Load train and features data
```{r load data, echo=FALSE}
train <- fread("/home/jasen/Personal-Work/Kaggle/jane-street-market-prediction/train.csv")
# Trim the dataset by 75% for dev purposes
# train <- train[1:(floor(nrow(train)*0.25))]
features <- fread("/home/jasen/Personal-Work/Kaggle/jane-street-market-prediction/features.csv")
```

Once read into our R environment, the main _train_ dataset is 2.6GB, slightly bigger than reported at the start of Jonathan's _reduce_mem_usage()_ function

```{r read file size, echo=FALSE}
paste0("Size of train: ", round(object.size(train)/1000000, 0), " MB") # Get the in-memory size
paste0("Size of features: ", round(object.size(features)/1000, 0), " Kb")
```

There are 2.39m rows, and 138 columns. One of the great things about this dataset, is that its almost ready to go. Getting your data into this shape in production for ad-hoc analyses takes time.

If we take a closer look at the data it looks to be entirely numeric. That simplifies things a bit. The "Data Description" on the competition page explains the dataset quite nicely:

> "This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

> In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation."

So each observation is potentially a trade. If weight is 0, there was no trade. If weight is larger than 0, then the return of the trade would be _weight * resp_.

Trades with a weight of zero will not be included in the test set, as well as returns from horizons other than _resp_. The question that needs to be answered is whether or not information in the train set can increase the accuracy of our prediciton model, even if the data will not be available in the test set.

> "Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation."

The evaluation criteria are below:

> "This competition is evaluated on a utility score. Each row in the test set represents a trading opportunity for which you will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return.

> For each date i, we define:

$$
p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij}),
$$

$$
t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}},
$$

> where \(|i|\) is the number of unique dates in the test set. The utility is then defined as:

$$ u = min(max(t,0), 6) \sum p_i. $$

Based on the evaluation criteria, its clear the volatility of the returns generated by the model will be an important criterion.

Let's do a little more EDA...

# EDA

Jonathan gets 1,981,287 observations with a weight > 0. There are no weights < 0. That means 17% of the observations (ie. trades) have 0 weights. We will have to think a bit more about whether or not the observations with 0 weight should be disregarded.

If we look at the number of trades per day, the range is 29 to 18,884 with a mean and median roughly in line at around 4,500. That's a lot of trades. Should make for happy broker/s and exchange/s.

```{r eda, echo=FALSE}
# JL gets 1981287 obs with weight != 0, as do we
length(which(train$weight != 0))
# This equates to 17% of the dataset
length(which(train$weight == 0))/nrow(train)
# inspect the head
head(train,2)
# confirm no obs with weight < 0
any(train$weight<0) # no weights < 0
# visualize number of trades by date
obs_by_date <- train[ , .(count = .N), by=date]
p <- ggplot(data = obs_by_date, aes(x = date, y = count))
p + geom_line(color = "deepskyblue4") + labs(title = "Trade observations by date",
                       x = "date", y = "# of observations") +
  theme(plot.title = element_text(hjust = 0.5))
# plot(obs_by_date$count, type = "h") # base R
summary(obs_by_date$count)
```

At this point i dont see any value in keeping observations around with a zero weight, but before we throw them away lets look at a few visualizations of the data.

```{r eda plots, echo=FALSE}
# plot cumsum with all obs
# plot(cumsum(train$resp), type = "l")
p <- ggplot(train, aes(x = date, y = cumsum(train$resp)))
p + geom_line(color = "deepskyblue4") + labs(x = "date", y = "Cumsum _resp_") +
  theme(plot.title = element_text(hjust = 0.5))
# plot cumsum with weights > 0
# plot(cumsum(train$resp[which(train$weight > 0)]), type = "l")
train_gt0 <- train[which(train$weight > 0),]
p <- ggplot(train_gt0, aes(x = date, y = cumsum(train_gt0$resp)))
p + geom_line(color = "deepskyblue4") + labs(x = "date", y = "Cumsum _resp_ where weight gt0") +
  theme(plot.title = element_text(hjust = 0.5))
```

It feels like there *could* be some decent _resp_ values with 0 weight. Let's look at the summary stats for _resp_ for weights > 0 and weights = 0.

```{r resp weights analysis, echo=FALSE}
# First we check no NA values in resp...CHECK
any(is.na(train[which(train$weight == 0),]$resp))

paste0("Mean resp for weights > 0: ", mean(train[which(train$weight > 0),]$resp))

paste0("Mean resp for weights = 0: ", mean(train[which(train$weight == 0),]$resp))

mean(train[which(train$weight > 0),]$resp)/mean(train[which(train$weight == 0),]$resp)
```

Ok, i was wrong. Mean _resp_ (ie. return without weighting the return from the trade) is almost 5 times greater for trades with weight > 0 than for trades that were not taken. I think it would be safe to disregard those observations before getting into feature engineering.

Jonathan also makes an observation, based on another notebook he references, that the mean daily return has a positivs drift but when weighted by the weight column the drift changes. This would imply the weighting scheme needs some work...although i would argue the scheme for determining whether or not to trade appears to work.

```{r eda plots 2, echo=FALSE}
train <- train_gt0
# rm(train_gt0)
train$weight_x_resp <- train$weight * train$resp
# Base R plot
# plot(cumprod(train[, 1 + mean(resp), by = date]$V1), type = "l", ylim = c(0.83, 1.21))
# lines(cumprod(train[, 1 + mean(weight_x_resp), by = date]$V1), type = "l", col = "red")
# ggplot2
cumprod_mean_resp <- cumprod(train[, 1 + mean(resp), by = date]$V1)
cumprod_mean_weight_x_resp <- cumprod(train[, 1 + mean(weight_x_resp), by = date]$V1)
plot_df <- cbind.data.frame(seq(0, 499, 1), cumprod_mean_resp, cumprod_mean_weight_x_resp)
colnames(plot_df)[1] <- "Date"
p <- ggplot(plot_df, aes(Date)) +
  geom_line(aes(y = cumprod_mean_resp), color = "deepskyblue4") +
  geom_line(aes(y = cumprod_mean_weight_x_resp), color = "firebrick")
p + labs(x = "date") + theme(axis.title.y = element_blank())
```

# Target Engineering

The main objective of Jonathan's notebook is to introduce the reader to a multi-target approach to solving this problem, using PyTorch Lightning. Jonathan talks about *Target Engineering*, and that for a low signal-to-noise problem like this (thanks to domain knowledge in trading i guess, since there is no signal analysis before this statement) engineering the target will help improve the robustness of any proposed model. Apparently many in the competition to date were binarizing the target variable _(resp)_ based on whether the value was larger than 0 or not. Jonthan asks whether or not he can do better than that.

Weight is important for this competition. It is a private dataset after all, based on (presumably) whether or not Jane Street made a trade and how much of a position they allocated to a trade. Jonthan mentions simply multiplying _resp_ by _weight_ wont be of much value. Im not sure why exactly. He mentions a better aproach is to focus on the top N obs of both _resp_ and _weight_. I agree this would be an interesting question to answer, but ultimately your choice of threshold will be important. Nevertheless lets reproduce a few plots.

```{r scatter resp to weight, echo = FALSE}
# plot(train$weight, train$resp) # not excluding zero wieghts just yet
p <- ggplot(train, aes(x = weight, y = resp))
p + geom_point(alpha = 0.25, color = "deepskyblue4") + labs(x = "weight", y = "resp")
```

Not immediately sure what the scatter plot tells us, other than that the higher _weight_ trades occur where _resp_ is closer to zero. Im still not sure whether the dataset is for a single security or for multiple securities. I suspect it is for multiple, as the _resp_ distribution is quite wide for a single security. Presumably Jane Street's weighting scheme takes into account volatility estimates, evidenced by higher weightings closer to the mean and potentially hinted at in the evaluation criteria which appears to include a term for annualizing volatility.

Lets take a closer look at the distribution of the _resp_ variable.

```{r hist, echo=FALSE}
# hist(train$resp, breaks = 100)
p <- ggplot(train, aes(x = resp))
p + geom_histogram(color = "deepskyblue4", fill= "deepskyblue4", bins = 100)
```

Similar to our scatter plot. Not surprising.

Jonthan goes on to view a histogram of log transformed weights. We look at weights without log transforms too. The weights are log-normally distributed. This is probably important, but its not clear yet whether or not this information is used by Jonathan.

```{r hist2, echo=FALSE}
# hist(train$weight, breaks = 50) # not throwing any away yet
# hist(log(train$weight), breaks = 50)
p <- ggplot(train, aes(x = weight))
p + geom_histogram(color = "deepskyblue4", fill= "deepskyblue4", bins = 50)

p <- ggplot(train, aes(x = log(weight)))
p + geom_histogram(color = "deepskyblue4", fill= "deepskyblue4", bins = 50)
```

Jonthan goes on to bin the _weight*resp_ values into 100 bins. We use the _qcut_ function from timereg for this.

```{r cut and bin, echo=FALSE}
require(timereg) # after a long time searching, could not find a convenient function other than qcut from timereg
bins = 100
train <- train[which(train$weight > 0),]
train[ , daily_wr_bin := qcut(weight_x_resp, cuts = bins, labels = FALSE), by=date]
head(train[,138:140], 10) # Agrees with JL result, off by 1 for zero-index :)
```

Most trade observations are in the tails of this distribution...

```{r plot daily bins, echo=FALSE}
# plot(train$daily_wr_bin, train$weight_x_resp)

p <- ggplot(train, aes(x = daily_wr_bin, y = weight_x_resp))
p + geom_point(alpha = 0.25, color = "deepskyblue4") + labs(x = "weight", y = "resp") + geom_vline(xintercept=bins/2, linetype = 2, color = "red")
```

Next, we add a bin_resp col.

```{r bin_resp, echo=FALSE}
train$bin_resp <- as.numeric(train$daily_wr_bin > bins/2)
head(train$bin_resp)
```

At this point Jonthan does some logistic regression using cross validation, and concludes that there is a greater possibility for predicting _weight_x_resp_ when they are in the top 50% of observations. Let's dive into some new Python code and see if we can find an R equivalent.

Firstly, Jonthan uses a function called Pipeline, which stores the imputation method (SimpleImputer with strategy='mean'), the scaler (he uses StandardScaler from scikit-learn, and a LogisticRegression model also from scikit-learn with params C and max_iter=1000). This pipeline of steps is later passed to custom function _run_cv_ which runs the logistic regression. Jonathan uses a Purged Group Time Series Split cross validation method for the prediction. He basically splits the train data 3 times, into 4 folds, ensuring the test data is always after the train data, and minimizing leakage by adding a gap of 20 days between the train and test set. We, however, will run the logistic regression only once, using _resp > 0_ and _bin resp > 0_ on the entire dataset. We wont bother with cross validation for now...although im unaware of a Purged Group Time Series Split cross validation function in R, and replicating the function Jonthan uses may be a useful addition to the R ecosystem (assuming Rob Hyndman has not done so already with the *forecast* package).

```{r simple impute, echo=FALSE}
feature_cols <- which(grepl("feature", colnames(train)))
x <- train[, feature_cols, with = FALSE]
paste0("Any NA before imputing: ", any(is.na(x)))
# Could not find a simple way to impute, so ran a for loop for each column and filled NA with the mean of the vector
for(j in 1:length(feature_cols)){
  if(any(is.na(x[,j, with=FALSE]))){
    # print(j)
  }
}
# We see most cols have NA. Lets fix...
for(j in 1:length(feature_cols)){
  set(x, i = which(is.na(x[[j]])), j = j, value = mean(train[[j]], na.rm=TRUE))
}
# lets test
paste0("Any NA after imputing: ", any(is.na(x))) # confirm no NAs
```

Now we scale, using standard scaling...

```{r scale features, echo=FALSE}
x <- scale(x)
head(x[,1:10])
mean(x[,1]) # should tend to zero
sd(x[,1]) # should tend to 1
# consider x scaled
```

*WARNING:* By now you are likely using heaps of RAM. For that reason it may be worthwhile to trim the train dataset by 75% for example. You will still have >500k observations. If you have 64GB of RAM you probably dont have to worry. If you have 32GB like me...and cant use Kaggle's resources for this...you should probably trim the train dataset and restart R. Im sure there is an alternative for R similar to the _reduce_mem_usage()_ function Jonthan uses...but i have not figured it out yet.

Ok, lets do some logistic regression...

```{r trim train, echo=FALSE}
# Trim the dataset by 75% for dev purposes
train <- train[1:(floor(nrow(train)*0.25))]
rm(train_gt0)
```

```{r logistic regression, echo=FALSE}
train_end <- floor(0.8 * nrow(train))
test_start <- train_end + 1
test_end <- nrow(train)
x_train <- x[1:train_end,]
y_train <- as.numeric(train[["resp"]]>0)[1:train_end]
x_test <- x[test_start:test_end,]
y_test <- as.numeric(train[["resp"]]>0)[test_start:test_end]
head(y_train)
head(y_test)
train_data <- cbind.data.frame(y_train, x_train)
test_data <- cbind.data.frame(y_test, x_test)
head(train_data[1:10,1:10])
rm(x)
# Logistics Regression
glm.fit <- glm(y_train ~ ., data = train_data, family = binomial)
summary(glm.fit)
glm.probs <- predict(glm.fit, newdata = test_data, type=c("response"))
if (!require(pROC)) install.packages("pROC", verbose=TRUE)
auc <- roc(y_test ~ glm.probs, data = cbind.data.frame(y_test, x_test))
auc # 52.54% on the whole dataset, 51.14% on the trimmed dataset
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
mean(glm.pred == y_test) # 51.67% classification accuracy on the whole dataset, 51.25% on the trimmed dataset
```

Ok, we get just above 51% accuracy using logistic regression with _resp_, in line with Jonathan's approach using cross validation and a python model.

Now we try the regression on _bin_resp_...

```{r logistic regression bin, echo=FALSE}
y_train <- as.numeric(train[["bin_resp"]])[1:train_end]
y_test <- as.numeric(train[["bin_resp"]])[test_start:test_end]
head(y_train)
head(y_test)
train_data <- cbind.data.frame(y_train, x_train)
test_data <- cbind.data.frame(y_test, x_test)
head(train_data[1:10,1:10])
# Logistics Regression
glm.fit <- glm(y_train ~ ., data = train_data, family = binomial)
summary(glm.fit)
glm.probs <- predict(glm.fit, newdata = test_data, type=c("response"))
if (!require(pROC)) install.packages("pROC", verbose=TRUE)
auc <- roc(y_test ~ glm.probs, data = cbind.data.frame(y_test, x_test))
auc # 52.56% on the entire dataset, 51.17% on the trimmed dataset
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
mean(glm.pred == y_test) # 51.8% classification accuracy on the entire dataset, 50.7% on the trimmed dataset
```

Ok, so when running logistic regression on the entire dataset, we see an ever so slight improvement when using a target variable _bin_resp_, with an AUC similar to Jonathan. However when we look at the difference between the 2 models when using the trimmed data, we see the model deteriorate when using bin_resp. Interesting. Either way, the differences are so marginal it almost feels like the choice between target variables is insignificant. Of course, this might not be the case using engineered features. But that is a TODO.

Jonathan concludes that predicting _bin resp_ (ie. resp * weight in the top 50%) gives greater accuracy than predicting _resp_. The difference is marginal in his results, and in ours too, although in the same direction (on the entire dataset).

Now we plot _resp_ against each of _resp_1_ to _resp_4_ to show why Jonathan thinks _resp_ is a longer time horizon than _resp_1_ to _resp_3_ but shorter than _resp_4_. This is quite genius, but im not sure it adds to our model. More on that later...

```{r plot horizins resp resp_1, echo=FALSE}
# plot(train$resp, type = "l", col = "black")
# lines(train$resp_1, type = "l", col = "red")

# see https://community.rstudio.com/t/adding-manual-legend-to-ggplot2/41651/3 for manually adding legend in ggplot
colors <- c("resp" = "deepskyblue4", "resp_1" = "firebrick")
p <- ggplot(train, aes(date)) +
  geom_line(aes(y = resp, color = "resp")) +
  geom_line(aes(y = resp_1, color = "resp_1"))
p + labs(x = "date", color = "Legend") + scale_color_manual(values = colors) + theme(axis.title.y = element_blank())
```

So as we can see, _resp_ is more volatile than _rep_1_.

```{r plot horizons resp resp_2, echo=FALSE}
# plot(train$resp, type = "l", col = "black")
# lines(train$resp_2, type = "l", col = "red")

colors <- c("resp" = "deepskyblue4", "resp_2" = "firebrick")
p <- ggplot(train, aes(date)) +
  geom_line(aes(y = resp, color = "resp")) +
  geom_line(aes(y = resp_2, color = "resp_2"))
p + labs(x = "date", color = "Legend") + scale_color_manual(values = colors) + theme(axis.title.y = element_blank())
```

Also, _resp_ is more volatile than _resp_2_.

```{r plot horizins resp resp_3, echo=FALSE}
# plot(train$resp, type = "l", col = "black")
# lines(train$resp_3, type = "l", col = "red")

colors <- c("resp" = "deepskyblue4", "resp_3" = "firebrick")
p <- ggplot(train, aes(date)) +
  geom_line(aes(y = resp, color = "resp")) +
  geom_line(aes(y = resp_3, color = "resp_3"))
p + labs(x = "date", color = "Legend") + scale_color_manual(values = colors) + theme(axis.title.y = element_blank())
```

Again, _resp_ is more volatile than _resp_3_.

As Jonathan mentions, a stylized fact of the volatility of returns from financial time series is that they scale with time. On this basis, it would be fair to assume the horizons for resp_1, resp_2 and resp_3 are shorter than resp. The next plot shows resp_4 being more volatile than resp, which would imply resp having a shorter horizon than resp_4.

```{r plot horizins resp resp_4, echo=FALSE}
# plot(train$resp, type = "l", col = "black")
# lines(train$resp_4, type = "l", col = "red")

colors <- c("resp" = "deepskyblue4", "resp_4" = "firebrick")
p <- ggplot(train, aes(date)) +
  geom_line(aes(y = resp, color = "resp")) +
  geom_line(aes(y = resp_4, color = "resp_4"))
p + labs(x = "date", color = "Legend") + scale_color_manual(values = colors) + theme(axis.title.y = element_blank())
```

This is relevant because in addition to the end result (the observed return for a given horizon) there may be some value in looking at the path to the destination, which the shorter horizons give us insight into. It is, as Jonathan mentions, a corollary to Lopez de Prado's Triple Barrier Method. There is some R code for this in the [ML for Factor Investing book](http://www.mlfactor.com/Data.html#the-triple-barrier-method), and we have an [Issue open for this in quantstrat](https://github.com/braverock/quantstrat/issues/105).

Of course, we can look at the standard deviations of these returns directly to observe their "volatility". Clearly _resp_4_ has a greater standard deviation than all the other horizons, and _resp_ greater than all the others. We will see later if it adds value.

```{r plot standard deviations, echo=FALSE}
# barplot(c(sd(train$resp_1), sd(train$resp_2), sd(train$resp_3), sd(train$resp_4), sd(train$resp)))

plot_df <- cbind.data.frame(resp_1 = sd(train$resp_1), resp_2 = sd(train$resp_2), resp_3 = sd(train$resp_3), resp = sd(train$resp), resp_4 = sd(train$resp_4))
plot_df <- as.data.frame(sapply(plot_df, format, digits = 2))
# colnames(plot_df) <- c("resp_1", "resp_2", "resp_3", "resp", "resp_4")
plot_df <- as.data.frame(cbind(plot_df, rbind("not_resp", "not_resp", "not_resp", "resp", "not_resp")))
colnames(plot_df) <- c("StandardDeviation", "Fill")
ggplot(plot_df, aes(x = rownames(plot_df), y = StandardDeviation, fill = Fill)) + geom_bar(stat = "identity") + labs(title = "Standard Deviation  of each `resp_`") + theme(axis.title.y = element_blank(), axis.title.x = element_blank(), legend.position = "none", plot.title = element_text(hjust = 0.5)) + scale_fill_manual("legend", values = c("resp" = "deepskyblue4", "not_resp" = "firebrick"))
```

# Time for _torch_

Now we try replicate Jonthan's pytorch code in r. In order to do this, i relied heavily on both Jonathan's code and most of the tutorials from the ["torch for R"](https://torch.mlverse.org/) website.

First, create the dataset using the torch dataset() method, which is an R6 class in R.

```{r dataset preprocessing, echo=FALSE}
require(torch)
BATCH_SIZE = 4096

janestreet_data <- train

janestreet_dataset <- dataset(
  name = "janestreet_dataset",

  initialize = function(indices) {
    data <- self$prepare_janestreet_data(janestreet_data[indices, ])
    self$x <- data[[1]]
    self$y <- data[[2]]
  },

  .getitem = function(i) {
    x <- self$x[i, ]
    y <- self$y[i, ]

    list(x = x, y = y)
  },

  .length = function() {
    dim(self$y)[1]
  },

  prepare_janestreet_data = function(input) {

    input <- input
    # %>%
    #   mutate(across(.fns = as.factor))

    target_col <- as.numeric(input$resp>0) %>%
      as.integer() %>% # should not need this for the Jase Street dataset, but leaving for now
      `-`(1) %>% # Not sure about making all positive ones, negative ones...
      as.matrix()

    # categorical_cols <- input %>%
    #   select(-poisonous) %>%
    #   select(where(function(x) nlevels(x) != 2)) %>%
    #   mutate(across(.fns = as.integer)) %>%
    #   as.matrix()

    feature_cols <- which(grepl("feature", colnames(input)))
    x <- input[, feature_cols, with = FALSE]
    # now get rid of NAs
    for(j in 1:length(feature_cols)){
      set(x, i = which(is.na(x[[j]])), j = j, value = mean(janestreet_data[[j]], na.rm=TRUE))}
    x <- as.matrix(x)

    #
    # numerical_cols <- input %>%
    #   select(-poisonous) %>%
    #   select(where(function(x) nlevels(x) == 2)) %>%
    #   mutate(across(.fns = as.integer)) %>%
    #   as.matrix()

    # list(list(torch_tensor(categorical_cols), torch_tensor(numerical_cols)),
    #      torch_tensor(target_col))

    list(torch_tensor(x), torch_tensor(target_col))
  }
)
```

Data Loader time...

```{r torch dataloader, echo=FALSE}
train_indices <- sample(1:nrow(janestreet_data), size = floor(0.8 * nrow(janestreet_data)))
valid_indices <- setdiff(1:nrow(janestreet_data), train_indices)

# debugonce(janestreet_dataset)
train_ds <- janestreet_dataset(train_indices)
train_dl <- train_ds %>% dataloader(batch_size = BATCH_SIZE, shuffle = FALSE) # change shuffle to FALSE, dont think we wanna be shuffling time series data

valid_ds <- janestreet_dataset(valid_indices)
valid_dl <- valid_ds %>% dataloader(batch_size = BATCH_SIZE, shuffle = FALSE)
```

Model configuration time...

```{r model in R, echo=FALSE}
TARGET_NAMES = c('resp_target', 'resp_2_target')
FEATURE_NAMES = colnames(train[, feature_cols, with = FALSE])

NUM_FEATURES = length(FEATURE_NAMES)
NUM_TARGETS = length(TARGET_NAMES)

# dropouts <- list(0.10, 0.20, 0.25, 0.25, 0.25)
# hidden_size = list(384, 896, 896, 384)

model <- nn_sequential(

  nn_batch_norm1d(NUM_FEATURES),

  nn_dropout(0.10),
  nn_linear(NUM_FEATURES, 384),
  nn_batch_norm1d(384),
  # nn_hardswish(),

  nn_dropout(0.20),
  nn_linear(384, 896),
  nn_batch_norm1d(896),
  # nn_hardswish(),

  nn_dropout(0.25),
  nn_linear(896, 896),
  nn_batch_norm1d(896),
  # nn_hardswish(),

  nn_dropout(0.25),
  nn_linear(896, 384),
  nn_batch_norm1d(384),
  # nn_hardswish(),

  nn_dropout(0.25),
  nn_linear(384, 1),
  nn_sigmoid()

)
```

```{r model parameters, echo=FALSE}
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)

learning_rate <- 0.0001
optimizer <- optim_adam(model$parameters, lr = learning_rate)
```

Training...

*WARNING:* Have your social media platform of choice open to peruse while your model is training, unless of course you have CUDA installed and GPU options available (i went down this path which caused multiple other headaches, and suspect my graphics card is a bit outdated).

```{r training, echo=FALSE}
EPOCHS = 5

for (epoch in 1:EPOCHS) {

  model$train()
  train_losses <- c()

  for (b in enumerate(train_dl)) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = device))
    # calculate loss
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    train_losses <- c(train_losses, loss$item())
  }

  model$eval()
  valid_losses <- c()

  for (b in enumerate(valid_dl)) {
    output <- model(b[[1]]$to(device = device))
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    valid_losses <- c(valid_losses, loss$item())
  }

  cat(sprintf("Loss at epoch %d: training: %3f, validation: %3f\n", epoch, mean(train_losses), mean(valid_losses)))
}
```

Evaluation

```{r Evaluation, echo=FALSE}
model$eval()

test_dl <- valid_ds %>% dataloader(batch_size = valid_ds$.length(), shuffle = FALSE)
iter <- test_dl$.iter()
b <- iter$.next()

# test_dl <- train_ds %>% dataloader(batch_size = train_ds$.length(), shuffle = FALSE)
# iter <- test_dl$.iter()
# b <- iter$.next()

output <- model(b[[1]]$to(device = device))
preds <- output$to(device = "cpu") %>% as.array()
preds <- ifelse(preds > 0.5, 1, 0)

comp_df <- data.frame(preds = preds, y = b[[2]] %>% as_array())
num_correct <- sum(comp_df$preds == comp_df$y)
num_total <- nrow(comp_df)
accuracy <- num_correct/num_total
accuracy
```

So we get 51.4% accuracy on the validation set, and 51.2% on the train set. This is using only resp>0 as the target variable. Lets try weight_x_resp > 0 next and then bin_resp after that.

```{r dataset preprocessing, training, evaluation - round 2, echo=FALSE}
# Preprocessing
janestreet_dataset <- dataset(
  name = "janestreet_dataset",

  initialize = function(indices) {
    data <- self$prepare_janestreet_data(janestreet_data[indices, ])
    self$x <- data[[1]]
    self$y <- data[[2]]
  },

  .getitem = function(i) {
    x <- self$x[i, ]
    y <- self$y[i, ]

    list(x = x, y = y)
  },

  .length = function() {
    dim(self$y)[1]
  },

  prepare_janestreet_data = function(input) {

    input <- input
    # %>%
    #   mutate(across(.fns = as.factor))

    target_col <- as.numeric(input$weight_x_resp>0) %>%
      as.integer() %>% # should not need this for the Jase Street dataset, but leaving for now
      `-`(1) %>% # Not sure about making all positive ones, negative ones...
      as.matrix()

    feature_cols <- which(grepl("feature", colnames(input)))
    x <- input[, feature_cols, with = FALSE]
    # now get rid of NAs
    for(j in 1:length(feature_cols)){
      set(x, i = which(is.na(x[[j]])), j = j, value = mean(janestreet_data[[j]], na.rm=TRUE))}
    x <- as.matrix(x)

    list(torch_tensor(x), torch_tensor(target_col))
  }
)

# Training
EPOCHS = 5

for (epoch in 1:EPOCHS) {

  model$train()
  train_losses <- c()

  for (b in enumerate(train_dl)) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = device))
    # calculate loss
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    train_losses <- c(train_losses, loss$item())
  }

  model$eval()
  valid_losses <- c()

  for (b in enumerate(valid_dl)) {
    output <- model(b[[1]]$to(device = device))
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    valid_losses <- c(valid_losses, loss$item())
  }

  cat(sprintf("Loss at epoch %d: training: %3f, validation: %3f\n", epoch, mean(train_losses), mean(valid_losses)))
}

# Evaluation
model$eval()

test_dl <- valid_ds %>% dataloader(batch_size = valid_ds$.length(), shuffle = FALSE)
iter <- test_dl$.iter()
b <- iter$.next()

# test_dl <- train_ds %>% dataloader(batch_size = train_ds$.length(), shuffle = FALSE)
# iter <- test_dl$.iter()
# b <- iter$.next()

output <- model(b[[1]]$to(device = device))
preds <- output$to(device = "cpu") %>% as.array()
preds <- ifelse(preds > 0.5, 1, 0)

comp_df <- data.frame(preds = preds, y = b[[2]] %>% as_array())
num_correct <- sum(comp_df$preds == comp_df$y)
num_total <- nrow(comp_df)
accuracy <- num_correct/num_total
accuracy
```

Around 51.36% to 51.46% accuracy depending whether using the entire or the trimmed dataset. Not great. Another TODO would be to see what accuracy we get with Jonathan's code using the same target variables. Of course, we are not using multi-target engineering yet.

```{r dataset preprocessing, training, evaluation - round 3, echo=FALSE}
# Preprocessing
janestreet_dataset <- dataset(
  name = "janestreet_dataset",

  initialize = function(indices) {
    data <- self$prepare_janestreet_data(janestreet_data[indices, ])
    self$x <- data[[1]]
    self$y <- data[[2]]
  },

  .getitem = function(i) {
    x <- self$x[i, ]
    y <- self$y[i, ]

    list(x = x, y = y)
  },

  .length = function() {
    dim(self$y)[1]
  },

  prepare_janestreet_data = function(input) {

    input <- input
    # %>%
    #   mutate(across(.fns = as.factor))

    target_col <- as.numeric(input$bin_resp) %>%
      as.integer() %>% # should not need this for the Jase Street dataset, but leaving for now
      `-`(1) %>% # Not sure about making all positive ones, negative ones...
      as.matrix()

    feature_cols <- which(grepl("feature", colnames(input)))
    x <- input[, feature_cols, with = FALSE]
    # now get rid of NAs
    for(j in 1:length(feature_cols)){
      set(x, i = which(is.na(x[[j]])), j = j, value = mean(janestreet_data[[j]], na.rm=TRUE))}
    x <- as.matrix(x)

    list(torch_tensor(x), torch_tensor(target_col))
  }
)

# Training
EPOCHS = 5

for (epoch in 1:EPOCHS) {

  model$train()
  train_losses <- c()

  for (b in enumerate(train_dl)) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = device))
    # calculate loss
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    train_losses <- c(train_losses, loss$item())
  }

  model$eval()
  valid_losses <- c()

  for (b in enumerate(valid_dl)) {
    output <- model(b[[1]]$to(device = device))
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    valid_losses <- c(valid_losses, loss$item())
  }

  cat(sprintf("Loss at epoch %d: training: %3f, validation: %3f\n", epoch, mean(train_losses), mean(valid_losses)))
}

# Evaluation
model$eval()

test_dl <- valid_ds %>% dataloader(batch_size = valid_ds$.length(), shuffle = FALSE)
iter <- test_dl$.iter()
b <- iter$.next()

# test_dl <- train_ds %>% dataloader(batch_size = train_ds$.length(), shuffle = FALSE)
# iter <- test_dl$.iter()
# b <- iter$.next()

output <- model(b[[1]]$to(device = device))
preds <- output$to(device = "cpu") %>% as.array()
preds <- ifelse(preds > 0.5, 1, 0)

comp_df <- data.frame(preds = preds, y = b[[2]] %>% as_array())
num_correct <- sum(comp_df$preds == comp_df$y)
num_total <- nrow(comp_df)
accuracy <- num_correct/num_total
accuracy
```

Accuracy 51.37% to 51.47%...again, no better.

We will need to do some feature engineering to see if we can improve the model.

Some more EDA...in particular lets look more closely at the return horizons of each _resp_.

```{r EDA round 2, echo=FALSE}
cols <- c("resp_1", "resp_2", "resp_3", "resp")
# mat <- as.matrix(train[ , cols, with = FALSE])
resp <- as.numeric(train$resp>0)
resp_1 <- as.numeric(train$resp_1>0)
resp_2 <- as.numeric(train$resp_2>0)
resp_3 <- as.numeric(train$resp_3>0)
resp_4 <- as.numeric(train$resp_4>0)
mat <- cbind(resp_1, resp_2, resp_3, resp)
heatmap(head(mat), scale = "none", Rowv = NA, Colv = NA, col = cm.colors(2), main = "HeatMap Example")
```

Too many observations to view with heatmap so we need a better way to represent things. Lets look at how many observations for each of resp>0 and resp<0 had identical signs in the horizons leading up to _resp_.

```{r resp analysis, echo=FALSE}
paste0("Number of obs with all horizons > 0 (excl resp_4): ", length(which(mat[,1] * mat[,2] * mat[,3] * mat[,4] == 1)))

paste0("Total observations: ", nrow(mat))
```

Ok we have a decent number of observations i think, even on the trimmed dataset, where resp_1, resp_2, resp_3 and resp is > 0. Lets make that our target (TODO: understand and quantify how this is different to Jonathan's multi-target approach).

```{r dataset preprocessing resp1, 2, 3 and resp, echo=FALSE}
janestreet_data <- train

janestreet_dataset <- dataset(
  name = "janestreet_dataset",

  initialize = function(indices) {
    data <- self$prepare_janestreet_data(janestreet_data[indices, ])
    self$x <- data[[1]]
    self$y <- data[[2]]
  },

  .getitem = function(i) {
    x <- self$x[i, ]
    y <- self$y[i, ]

    list(x = x, y = y)
  },

  .length = function() {
    dim(self$y)[1]
  },

  prepare_janestreet_data = function(input) {

    input <- input
    # %>%
    #   mutate(across(.fns = as.factor))

    target_col <- as.numeric(as.numeric(input$resp_1>0) *
                               as.numeric(input$resp_2>0) *
                               as.numeric(input$resp_3>0) *
                               as.numeric(input$resp>0) > 0) %>%
      as.integer() %>% # should not need this for the Jase Street dataset, but leaving for now
      `-`(1) %>% # Not sure about making all positive ones, negative ones...
      as.matrix()

    feature_cols <- which(grepl("feature", colnames(input)))
    x <- input[, feature_cols, with = FALSE]
    # now get rid of NAs
    for(j in 1:length(feature_cols)){
      set(x, i = which(is.na(x[[j]])), j = j, value = mean(janestreet_data[[j]], na.rm=TRUE))}
    x <- as.matrix(x)

    list(torch_tensor(x), torch_tensor(target_col))
  }
)
```

```{r training target engineering, echo=FALSE}
EPOCHS = 5

for (epoch in 1:EPOCHS) {

  model$train()
  train_losses <- c()

  for (b in enumerate(train_dl)) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = device))
    # calculate loss
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    train_losses <- c(train_losses, loss$item())
  }

  model$eval()
  valid_losses <- c()

  for (b in enumerate(valid_dl)) {
    output <- model(b[[1]]$to(device = device))
    loss <- nnf_binary_cross_entropy(output, b[[2]]$to(device = device))
    valid_losses <- c(valid_losses, loss$item())
  }

  cat(sprintf("Loss at epoch %d: training: %3f, validation: %3f\n", epoch, mean(train_losses), mean(valid_losses)))
}
```

Evaluation

```{r Evaluation target engineering, echo=FALSE}
model$eval()

test_dl <- valid_ds %>% dataloader(batch_size = valid_ds$.length(), shuffle = FALSE)
iter <- test_dl$.iter()
b <- iter$.next()

# test_dl <- train_ds %>% dataloader(batch_size = train_ds$.length(), shuffle = FALSE)
# iter <- test_dl$.iter()
# b <- iter$.next()

output <- model(b[[1]]$to(device = device))
preds <- output$to(device = "cpu") %>% as.array()
preds <- ifelse(preds > 0.5, 1, 0)

comp_df <- data.frame(preds = preds, y = b[[2]] %>% as_array())
num_correct <- sum(comp_df$preds == comp_df$y)
num_total <- nrow(comp_df)
accuracy <- num_correct/num_total
accuracy
```

51.5% accuracy...not great...was hoping for better...

# Conclusion

In this document we reproduced, in R, Jonathan Larkin's kindly shared Kaggle Python notebook for the Jane Street Market Prediction competition. There is still much more i hope to do before the competition comes to a close, including:

1. Feature engineering. There are 130 features as well as feature metadata, but their importance will not be uniform. Options include PCA and other feature importance techniques. Potentially look at some auto EDA packages for this purposes - See https://arxiv.org/pdf/1904.02101.pdf for a comprehensive summary of useful auto EDA packages in R.
2. Assess the accuracy with Jonathan's multi-target model, and how that compares to our single target technique.
3. Test multiple models and make a submission, in Python, using the best one.

Thanks to Jonathan for the original share, and of course the references in his Notebook and last but not least the team at RStudio who made torch available in R. Without all of them the above would not be possible. I certainly look forward to comparing Keras/Tensorflow with torch going forward.

I hope to put this info into a Kaggle R Notebook soon.

Thanks for reading.