I’ll make some dummy data to show how to do this. Ignore how ludicruous this data is. I’m going to make it so that there are 4 classes, otherwise we might get convergence issues when we try to pull these classes. In this data I’m going to use BMI as the outcome variable, and we’ll also try inluding gender in the regression.
library(dplyr)
library(poLCA)
library(broom)
library(ggplot2)
set.seed(123)
# Data params
n <- 1000
latent_classes <- 4
# Create a data frame with class assignments
dummy_data <- data.frame(
id = 1:n,
latent_class = sample(1:latent_classes, n, replace = TRUE)
)
# Define probabilities for sports participation for each class at each time point
prob_matrix <- matrix(c(
0.1, 0.1, 0.1, 0.1, 0.6, # Class 1 - late starters
0.8, 0.8, 0.8, 0.8, 0.8, # Class 2 - continuous
0.1, 0.1, 0.1, 0.1, 0.1, # Class 3 - low participation
0.1, 0.2, 0.6, 0.7, 0.8 # Class 4 - increasing
), nrow = latent_classes, byrow = TRUE)
# Generate sports participation data based on class probabilities
for (time_point in 1:5) {
col_name <- paste0("sports_time", time_point)
dummy_data[[col_name]] <- apply(dummy_data, 1, function(row) {
class <- as.numeric(row["latent_class"])
prob <- prob_matrix[class, min(time_point, ncol(prob_matrix))]
rbinom(1, 1, prob) + 1
})
}
# Add gender and BMI variables
dummy_data <- dummy_data %>%
mutate(
gender = sample(c("Male", "Female"), n, replace = TRUE),
BMI = round(rnorm(n, mean = 25, sd = 4), 1)
)
# Remove the class variable since that's what we're trying to predict
dummy_data <- dummy_data %>%
dplyr::select(-latent_class)
head(dummy_data)
## id sports_time1 sports_time2 sports_time3 sports_time4 sports_time5 gender
## 1 1 1 1 1 1 1 Male
## 2 2 1 1 2 1 2 Female
## 3 3 1 1 1 1 1 Female
## 4 4 1 2 2 1 2 Female
## 5 5 1 1 1 1 1 Female
## 6 6 2 2 2 2 2 Female
## BMI
## 1 30.9
## 2 19.4
## 3 17.5
## 4 23.9
## 5 26.7
## 6 24.5
I don’t recall what package you were using to get classes, but they all work more or less the same.
# Specify the formula for the LCA model
formula <- cbind(sports_time1, sports_time2, sports_time3, sports_time4, sports_time5) ~ 1
# Fit LCA models with different numbers of classes
lca_model_2 <- poLCA(formula, data = dummy_data, nclass = 2, maxiter = 1000)
lca_model_3 <- poLCA(formula, data = dummy_data, nclass = 3, maxiter = 1000)
lca_model_4 <- poLCA(formula, data = dummy_data, nclass = 4, maxiter = 1000)
lca_model_5 <- poLCA(formula, data = dummy_data, nclass = 5, maxiter = 1000)
bic_values <- c(lca_model_2$bic, lca_model_3$bic, lca_model_4$bic, lca_model_5$bic)
names(bic_values) <- c("2 Classes", "3 Classes", "4 Classes", "5 Classes")
bic_values
## 2 Classes 3 Classes 4 Classes 5 Classes
## 5921.356 5873.820 5913.720 5954.247
Somehow I still ended up with 3 classes being best 🙅. Whatever, we’ll do three then.
So we have our model that ‘predicts’ what class each person is in. We need to add that class back into our dataframe.
dummy_data$class <- lca_model_3$predclass
head(dummy_data)
## id sports_time1 sports_time2 sports_time3 sports_time4 sports_time5 gender
## 1 1 1 1 1 1 1 Male
## 2 2 1 1 2 1 2 Female
## 3 3 1 1 1 1 1 Female
## 4 4 1 2 2 1 2 Female
## 5 5 1 1 1 1 1 Female
## 6 6 2 2 2 2 2 Female
## BMI class
## 1 30.9 1
## 2 19.4 1
## 3 17.5 1
## 4 23.9 1
## 5 26.7 1
## 6 24.5 2
So here is where I think you’re stuck. The next step is to take those
classes and see if they predict our outcome. You need to tell R that the
class is a factor, either by using as.factor
or by using factor
in
the regression.
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 24.9 0.208 120. 0
## 2 factor(class)2 -0.390 0.338 -1.15 0.249
## 3 factor(class)3 0.488 0.305 1.60 0.109
## 4 genderMale -0.0934 0.252 -0.370 0.711
Nothing happening, although it looks like maybe class 3 is different from class 1. We can also check the model fit.
glance(reg_model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.00540 0.00241 3.99 1.80 0.145 3 -2801. 5612. 5636.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
You also asked about plotting. You would do all of your standard plots to check for model fit, like looking at the residuals, etc. I’m not going to run this because I don’t want to add a bunch of plots to this document, but you can do it like this:
plot(reg_model)
When it comes to displaying the results, a table is usually best. But you can also plot the coefficients so you can see how they differ.
regression_tidy <- tidy(reg_model, conf.int = TRUE) %>%
filter(term != "(Intercept)")
# Plot the coefficients
ggplot(regression_tidy, aes(x = term, y = estimate)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
labs(title = "Regression Coefficients with Confidence Intervals",
x = "Predictor",
y = "Estimate") +
theme_minimal() +
coord_flip()