tarensanders/Example_LCA.md

## Example_LCA.md

      
    Raw
  

              Example_LCA.md
            
          
    Example regression of classes

Create Example Data

I’ll make some dummy data to show how to do this. Ignore how ludicruous
this data is. I’m going to make it so that there are 4 classes,
otherwise we might get convergence issues when we try to pull these
classes. In this data I’m going to use BMI as the outcome variable, and
we’ll also try inluding gender in the regression.
library(dplyr)
library(poLCA)
library(broom)
library(ggplot2)
set.seed(123)

# Data params
n <- 1000
latent_classes <- 4

# Create a data frame with class assignments
dummy_data <- data.frame(
  id = 1:n,
  latent_class = sample(1:latent_classes, n, replace = TRUE)
)

# Define probabilities for sports participation for each class at each time point
prob_matrix <- matrix(c(
  0.1, 0.1, 0.1, 0.1, 0.6,  # Class 1 - late starters
  0.8, 0.8, 0.8, 0.8, 0.8,  # Class 2 - continuous
  0.1, 0.1, 0.1, 0.1, 0.1,  # Class 3 - low participation
  0.1, 0.2, 0.6, 0.7, 0.8   # Class 4 - increasing
), nrow = latent_classes, byrow = TRUE)

# Generate sports participation data based on class probabilities
for (time_point in 1:5) {
  col_name <- paste0("sports_time", time_point)
  dummy_data[[col_name]] <- apply(dummy_data, 1, function(row) {
    class <- as.numeric(row["latent_class"])
    prob <- prob_matrix[class, min(time_point, ncol(prob_matrix))]
    rbinom(1, 1, prob) + 1
  })
}

# Add gender and BMI variables
dummy_data <- dummy_data %>%
  mutate(
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    BMI = round(rnorm(n, mean = 25, sd = 4), 1)
  )

# Remove the class variable since that's what we're trying to predict
dummy_data <- dummy_data %>%
  dplyr::select(-latent_class)

head(dummy_data)
##   id sports_time1 sports_time2 sports_time3 sports_time4 sports_time5 gender
## 1  1            1            1            1            1            1   Male
## 2  2            1            1            2            1            2 Female
## 3  3            1            1            1            1            1 Female
## 4  4            1            2            2            1            2 Female
## 5  5            1            1            1            1            1 Female
## 6  6            2            2            2            2            2 Female
##    BMI
## 1 30.9
## 2 19.4
## 3 17.5
## 4 23.9
## 5 26.7
## 6 24.5

Create the classes

I don’t recall what package you were using to get classes, but they all
work more or less the same.
# Specify the formula for the LCA model
formula <- cbind(sports_time1, sports_time2, sports_time3, sports_time4, sports_time5) ~ 1

# Fit LCA models with different numbers of classes
lca_model_2 <- poLCA(formula, data = dummy_data, nclass = 2, maxiter = 1000)
lca_model_3 <- poLCA(formula, data = dummy_data, nclass = 3, maxiter = 1000)
lca_model_4 <- poLCA(formula, data = dummy_data, nclass = 4, maxiter = 1000)
lca_model_5 <- poLCA(formula, data = dummy_data, nclass = 5, maxiter = 1000)
bic_values <- c(lca_model_2$bic, lca_model_3$bic, lca_model_4$bic, lca_model_5$bic)
names(bic_values) <- c("2 Classes", "3 Classes", "4 Classes", "5 Classes")
bic_values
## 2 Classes 3 Classes 4 Classes 5 Classes 
##  5921.356  5873.820  5913.720  5954.247

Somehow I still ended up with 3 classes being best 🙅. Whatever,
we’ll do three then.
So we have our model that ‘predicts’ what class each person is in. We
need to add that class back into our dataframe.
dummy_data$class <- lca_model_3$predclass
head(dummy_data)
##   id sports_time1 sports_time2 sports_time3 sports_time4 sports_time5 gender
## 1  1            1            1            1            1            1   Male
## 2  2            1            1            2            1            2 Female
## 3  3            1            1            1            1            1 Female
## 4  4            1            2            2            1            2 Female
## 5  5            1            1            1            1            1 Female
## 6  6            2            2            2            2            2 Female
##    BMI class
## 1 30.9     1
## 2 19.4     1
## 3 17.5     1
## 4 23.9     1
## 5 26.7     1
## 6 24.5     2

Analyse the classes

So here is where I think you’re stuck. The next step is to take those
classes and see if they predict our outcome. You need to tell R that the
class is a factor, either by using as.factor or by using factor in
the regression.
## # A tibble: 4 × 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     24.9        0.208   120.      0    
## 2 factor(class)2  -0.390      0.338    -1.15    0.249
## 3 factor(class)3   0.488      0.305     1.60    0.109
## 4 genderMale      -0.0934     0.252    -0.370   0.711

Nothing happening, although it looks like maybe class 3 is different
from class 1. We can also check the model fit.
glance(reg_model)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   0.00540       0.00241  3.99      1.80   0.145     3 -2801. 5612. 5636.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

You also asked about plotting. You would do all of your standard plots
to check for model fit, like looking at the residuals, etc. I’m not
going to run this because I don’t want to add a bunch of plots to this
document, but you can do it like this:
plot(reg_model)
When it comes to displaying the results, a table is usually best. But
you can also plot the coefficients so you can see how they differ.
regression_tidy <- tidy(reg_model, conf.int = TRUE) %>% 
  filter(term != "(Intercept)")

# Plot the coefficients
ggplot(regression_tidy, aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(title = "Regression Coefficients with Confidence Intervals",
       x = "Predictor",
       y = "Estimate") +
  theme_minimal() +
  coord_flip()