Skip to content

Instantly share code, notes, and snippets.

@tarensanders
Last active June 13, 2024 01:10
Show Gist options
  • Save tarensanders/899972d48fa86be82a03a33aae58bb4d to your computer and use it in GitHub Desktop.
Save tarensanders/899972d48fa86be82a03a33aae58bb4d to your computer and use it in GitHub Desktop.

Example regression of classes

Create Example Data

I’ll make some dummy data to show how to do this. Ignore how ludicruous this data is. I’m going to make it so that there are 4 classes, otherwise we might get convergence issues when we try to pull these classes. In this data I’m going to use BMI as the outcome variable, and we’ll also try inluding gender in the regression.

library(dplyr)
library(poLCA)
library(broom)
library(ggplot2)
set.seed(123)

# Data params
n <- 1000
latent_classes <- 4

# Create a data frame with class assignments
dummy_data <- data.frame(
  id = 1:n,
  latent_class = sample(1:latent_classes, n, replace = TRUE)
)

# Define probabilities for sports participation for each class at each time point
prob_matrix <- matrix(c(
  0.1, 0.1, 0.1, 0.1, 0.6,  # Class 1 - late starters
  0.8, 0.8, 0.8, 0.8, 0.8,  # Class 2 - continuous
  0.1, 0.1, 0.1, 0.1, 0.1,  # Class 3 - low participation
  0.1, 0.2, 0.6, 0.7, 0.8   # Class 4 - increasing
), nrow = latent_classes, byrow = TRUE)

# Generate sports participation data based on class probabilities
for (time_point in 1:5) {
  col_name <- paste0("sports_time", time_point)
  dummy_data[[col_name]] <- apply(dummy_data, 1, function(row) {
    class <- as.numeric(row["latent_class"])
    prob <- prob_matrix[class, min(time_point, ncol(prob_matrix))]
    rbinom(1, 1, prob) + 1
  })
}

# Add gender and BMI variables
dummy_data <- dummy_data %>%
  mutate(
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    BMI = round(rnorm(n, mean = 25, sd = 4), 1)
  )

# Remove the class variable since that's what we're trying to predict
dummy_data <- dummy_data %>%
  dplyr::select(-latent_class)

head(dummy_data)
##   id sports_time1 sports_time2 sports_time3 sports_time4 sports_time5 gender
## 1  1            1            1            1            1            1   Male
## 2  2            1            1            2            1            2 Female
## 3  3            1            1            1            1            1 Female
## 4  4            1            2            2            1            2 Female
## 5  5            1            1            1            1            1 Female
## 6  6            2            2            2            2            2 Female
##    BMI
## 1 30.9
## 2 19.4
## 3 17.5
## 4 23.9
## 5 26.7
## 6 24.5

Create the classes

I don’t recall what package you were using to get classes, but they all work more or less the same.

# Specify the formula for the LCA model
formula <- cbind(sports_time1, sports_time2, sports_time3, sports_time4, sports_time5) ~ 1

# Fit LCA models with different numbers of classes
lca_model_2 <- poLCA(formula, data = dummy_data, nclass = 2, maxiter = 1000)
lca_model_3 <- poLCA(formula, data = dummy_data, nclass = 3, maxiter = 1000)
lca_model_4 <- poLCA(formula, data = dummy_data, nclass = 4, maxiter = 1000)
lca_model_5 <- poLCA(formula, data = dummy_data, nclass = 5, maxiter = 1000)
bic_values <- c(lca_model_2$bic, lca_model_3$bic, lca_model_4$bic, lca_model_5$bic)
names(bic_values) <- c("2 Classes", "3 Classes", "4 Classes", "5 Classes")
bic_values
## 2 Classes 3 Classes 4 Classes 5 Classes 
##  5921.356  5873.820  5913.720  5954.247

Somehow I still ended up with 3 classes being best 🙅. Whatever, we’ll do three then.

So we have our model that ‘predicts’ what class each person is in. We need to add that class back into our dataframe.

dummy_data$class <- lca_model_3$predclass
head(dummy_data)
##   id sports_time1 sports_time2 sports_time3 sports_time4 sports_time5 gender
## 1  1            1            1            1            1            1   Male
## 2  2            1            1            2            1            2 Female
## 3  3            1            1            1            1            1 Female
## 4  4            1            2            2            1            2 Female
## 5  5            1            1            1            1            1 Female
## 6  6            2            2            2            2            2 Female
##    BMI class
## 1 30.9     1
## 2 19.4     1
## 3 17.5     1
## 4 23.9     1
## 5 26.7     1
## 6 24.5     2

Analyse the classes

So here is where I think you’re stuck. The next step is to take those classes and see if they predict our outcome. You need to tell R that the class is a factor, either by using as.factor or by using factor in the regression.

## # A tibble: 4 × 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     24.9        0.208   120.      0    
## 2 factor(class)2  -0.390      0.338    -1.15    0.249
## 3 factor(class)3   0.488      0.305     1.60    0.109
## 4 genderMale      -0.0934     0.252    -0.370   0.711

Nothing happening, although it looks like maybe class 3 is different from class 1. We can also check the model fit.

glance(reg_model)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   0.00540       0.00241  3.99      1.80   0.145     3 -2801. 5612. 5636.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

You also asked about plotting. You would do all of your standard plots to check for model fit, like looking at the residuals, etc. I’m not going to run this because I don’t want to add a bunch of plots to this document, but you can do it like this:

plot(reg_model)

When it comes to displaying the results, a table is usually best. But you can also plot the coefficients so you can see how they differ.

regression_tidy <- tidy(reg_model, conf.int = TRUE) %>% 
  filter(term != "(Intercept)")

# Plot the coefficients
ggplot(regression_tidy, aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(title = "Regression Coefficients with Confidence Intervals",
       x = "Predictor",
       y = "Estimate") +
  theme_minimal() +
  coord_flip()

unnamed-chunk-4-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment