background image used for decoration

Biomarkers analysis on proteomics data


PLS-DA Analysis for Age Group Comparison

Author: Ivan Osinnii

Date: May 25, 2025

Introduction

This is an intermediate R analysis example using simple machine learning techniques applied to biomarker discovery actually replicated from Python workflow with scikit.learn methods. Here we use R analogues from mixOmics (PLS-DA), glmnet (LASSO) and e1071 (SVM).

The Dataset was shared at the workshop “Open Innovations in Life Sciences” in October 2024 and had been taken from the manuscript “Discovery proteomics in aging human skeletal muscle finds change in spliceosome, immunity, proteostasis and mitochondria”. This study aimed to understand the molecular mechanisms behind the age-related decline in skeletal muscle strength, a primary cause of mobility loss and frailty in older adults.

Quantitative proteomic analysis was performed on muscle specimens collected from 58 healthy individuals, aged 20 to 87 years, after excluding two participants due to insufficient sample size. The participants were distributed across five age groups: 20–34 years (n = 13), 35–49 years (n = 11), 50–64 years (n = 12), 65–79 years (n = 12), and 80+ years (n = 10).

The analysis revealed that proteins related to energetic metabolism, including those associated with the TCA cycle, mitochondrial respiration, and glycolysis, were underrepresented in older individuals, while proteins involved in immune responses, proteostasis, and alternative splicing were overrepresented. These findings suggest that aging muscle is characterized by disrupted metabolic processes, a pro-inflammatory environment, and increased protein degradation. RNA-seq analysis further confirmed changes in alternative splicing with age, implying that the splicing machinery adapts to increased cellular damage in aging muscle.

Loading data and dependencies

Code
# Load the data
df <- read.csv(file.path("..", "input", params$data_file), stringsAsFactors = FALSE)

cat("Dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
Dataset dimensions: 58 rows x 2511 columns
Code
# Display first few rows
# head(df) %>% 
#   DT::datatable(options = list(scrollX = TRUE, pageLength = 5))

Exploratory Data Analysis: Unsupervised vs Supervised Dimensionality Reduction

Principal Component Analysis (PCA) serves as the initial exploratory technique, revealing natural clustering patterns in the biomarker data without considering group labels. This unsupervised approach helps identify inherent data structure and potential age-related separation.

Code
# Prepare data for PCA
intensity_data <- df %>% 
  select(-Sample_Name, -Age_Group) %>% 
  mutate_all(as.numeric)

group_info <- df$Age_Group

# Perform PCA
pca_result <- prcomp(intensity_data, scale. = TRUE)
pca_scores <- data.frame(pca_result$x[,1:2], Age_Group = group_info)

# Calculate explained variance
var_explained <- round(summary(pca_result)$importance[2,1:2] * 100, 2)

# Create PCA plot
colors <- c("80+" = "red", "20-34" = "blue", "35-49" = "green", 
           "50-64" = "orange", "65-79" = "purple")

pca_plot <- ggplot(pca_scores, aes(x = PC1, y = PC2, color = Age_Group)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = colors) +
  labs(title = "PCA Plot - All Age Groups",
       x = paste0("PC1 (", var_explained[1], "%)"),
       y = paste0("PC2 (", var_explained[2], "%)")) +
  theme_minimal() +
  theme(legend.position = "right")

print(pca_plot)

PLS-DA Analysis (All Groups)

Partial Least Squares Discriminant Analysis (PLS-DA) follows as a supervised alternative, explicitly incorporating age group information to maximize between-group separation. The addition of 95% confidence ellipses provides statistical context for group boundaries and overlap assessment.

Code
# Remove any rows with missing values
complete_rows <- complete.cases(intensity_data)
intensity_clean <- intensity_data[complete_rows, ]
group_clean <- group_info[complete_rows]

cat("After cleaning - Intensity data:", dim(intensity_clean), "\n")
After cleaning - Intensity data: 58 2509 
Code
cat("After cleaning - Group info length:", length(group_clean), "\n")
After cleaning - Group info length: 58 
Code
X <- as.matrix(intensity_clean)
Y <- as.factor(group_clean)

# Check for constant/zero variance columns
zero_var_cols <- apply(X, 2, var, na.rm = TRUE) == 0
if (sum(zero_var_cols) > 0) {
  cat("Removing", sum(zero_var_cols), "zero variance columns\n")
  X <- X[, !zero_var_cols]
}
# Alternative PLS-DA implementation using pls package
Y_numeric <- as.numeric(Y)
combined_data <- data.frame(Y = Y_numeric, X)

pls_model <- plsr(Y ~ ., data = combined_data, ncomp = 2, validation = "none")
pls_scores <- scores(pls_model)[, 1:2]

# Create plot data
plsda_scores <- data.frame(
  comp1 = pls_scores[, 1],
  comp2 = pls_scores[, 2],
  Age_Group = Y
)

cat("PLS scores dimensions:", dim(plsda_scores), "\n")
PLS scores dimensions: 58 3 
Code
# Create PLS-DA plot with confidence ellipses
plsda_plot <- ggplot(plsda_scores, aes(x = comp1, y = comp2, color = Age_Group)) +
  geom_point(alpha = 0.7, size = 3) +
  stat_ellipse(aes(color = Age_Group), type = "norm", linetype = "dashed", 
               level = 0.95, size = 1) +
  scale_color_manual(values = colors) +
  labs(title = "PLS-DA Plot - All Age Groups",
       x = "PLS Component 1",
       y = "PLS Component 2") +
  theme_minimal() +
  theme(legend.position = "right")

print(plsda_plot)

Code
# Display model summary
summary(pls_model)
Data:   X dimension: 58 2509 
    Y dimension: 58 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
   1 comps  2 comps
X    14.48    29.78
Y    58.36    77.69

Focusing on Extreme Age Groups

To enhance discriminatory power and simplify the classification problem, the analysis narrows focus to the most physiologically distinct groups: young adults (20-34) versus elderly (80+). This binary comparison maximizes the potential for identifying age-related biomarker patterns while reducing multiclass complexity.

Code
# Create subset with only specified groups
subset_data <- df %>% 
  filter(Age_Group %in% c(params$group1, params$group2))

cat("Subset dimensions:", nrow(subset_data), "rows x", ncol(subset_data), "columns\n")
Subset dimensions: 23 rows x 2511 columns
Code
cat("Groups included:", params$group1, "vs", params$group2, "\n")
Groups included: 80+ vs 20-34 
Code
# Display group counts
table(subset_data$Age_Group)

20-34   80+ 
   13    10 
Code
# Save subset data
# write.csv(subset_data, "31642809_Age_Old_Young.csv", row.names = FALSE)
# cat("Subset saved to: 31642809_Age_Old_Young.csv\n")

PLS-DA for Two-Group Comparison

The binary PLS-DA visualization confirms improved separation between extreme age groups compared to the five-group analysis. This focused approach provides clearer insight into age-related biomarker changes and facilitates more robust statistical modeling.

Code
# Prepare intensity data and group information for subset
subset_intensity <- subset_data %>% 
  select(-Sample_Name, -Age_Group) %>% 
  mutate_all(as.numeric)

subset_groups <- subset_data$Age_Group

# Clean data
complete_rows_subset <- complete.cases(subset_intensity)
subset_intensity_clean <- subset_intensity[complete_rows_subset, ]
subset_groups_clean <- subset_groups[complete_rows_subset]

# Perform PLS-DA on subset using pls package
X_subset <- as.matrix(subset_intensity_clean)
Y_subset_numeric <- as.numeric(as.factor(subset_groups_clean))

# Remove zero variance columns
zero_var_cols_subset <- apply(X_subset, 2, var, na.rm = TRUE) == 0
if (sum(zero_var_cols_subset) > 0) {
  X_subset <- X_subset[, !zero_var_cols_subset]
}

combined_subset_data <- data.frame(Y = Y_subset_numeric, X_subset)
pls_subset_model <- plsr(Y ~ ., data = combined_subset_data, ncomp = 2, validation = "none")
pls_subset_scores <- scores(pls_subset_model)[, 1:2]

# Extract scores for plotting
plsda_subset_scores <- data.frame(
  comp1 = pls_subset_scores[, 1],
  comp2 = pls_subset_scores[, 2],
  Age_Group = subset_groups_clean
)

# Create color map for two groups
two_group_colors <- c("80+" = "red", "20-34" = "skyblue")
used_colors <- two_group_colors[names(two_group_colors) %in% c(params$group1, params$group2)]

# Create PLS-DA plot for two groups
plsda_two_plot <- ggplot(plsda_subset_scores, aes(x = comp1, y = comp2, color = Age_Group)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = used_colors) +
  labs(title = paste("PLS-DA Plot -", params$group1, "vs", params$group2),
       x = "PLS Component 1",
       y = "PLS Component 2") +
  theme_minimal() +
  theme(legend.position = "right")

print(plsda_two_plot)

Code
# Display model summary
summary(pls_subset_model)
Data:   X dimension: 23 2509 
    Y dimension: 23 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
   1 comps  2 comps
X    14.98    40.12
Y    80.15    92.66

Feature Selection with LASSO and Biomarker Identification

Lasso regression with L1 regularization performs automated feature selection, identifying the most discriminative biomarkers while preventing overfitting. The sparse solution highlights 15 key features that contribute most significantly to age group classification, reducing dimensionality from hundreds to a manageable subset.

Code
# Prepare data for LASSO
X_matrix <- as.matrix(subset_intensity)
y_binary <- ifelse(subset_groups == params$group1, 1, 0)

# Standardize features
X_scaled <- scale(X_matrix)

# Perform LASSO regression
set.seed(params$random_seed)
lasso_model <- glmnet(X_scaled, y_binary, alpha = 1, lambda = params$lasso_alpha)

# Extract coefficients and select top features
lasso_coef <- as.matrix(coef(lasso_model))
lasso_coef_abs <- abs(lasso_coef[-1,])  # Remove intercept
top_features_idx <- order(lasso_coef_abs, decreasing = TRUE)[1:params$n_features]
top_features <- colnames(subset_intensity)[top_features_idx]

cat("Top", params$n_features, "features selected using LASSO:\n")
Top 15 features selected using LASSO:
Code
for(i in 1:length(top_features)) {
  cat(i, ".", top_features[i], "\n")
}
1 . SAMP_HUMAN 
2 . SCOT1_HUMAN 
3 . XRCC6_HUMAN 
4 . SESN1_HUMAN 
5 . ROA3_HUMAN 
6 . DNJA1_HUMAN 
7 . MOC2B_HUMAN 
8 . SMYD5_HUMAN 
9 . RFA1_HUMAN 
10 . HNRPC_HUMAN 
11 . STAU2_HUMAN 
12 . ENOA_HUMAN 
13 . AATC_HUMAN 
14 . MK14_HUMAN 
15 . SCRB2_HUMAN 
Code
# Create coefficient plot
coef_df <- data.frame(
  Feature = names(lasso_coef_abs[top_features_idx]),
  Coefficient = lasso_coef_abs[top_features_idx]
)

coef_plot <- ggplot(coef_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = paste("Top", params$n_features, "LASSO Selected Features"),
       x = "Features", y = "Absolute Coefficient") +
  theme_minimal()

print(coef_plot)

Biomarker Expression Profiling

Individual biomarker visualization through box plots with overlaid swarm plots reveals the distribution patterns and expression differences between young and elderly groups. This detailed examination validates the selected features and provides biological insight into age-related changes at the molecular level.

Code
# Prepare data for plotting
plot_data <- subset_data %>% 
  select(all_of(c("Age_Group", top_features))) %>%
  pivot_longer(cols = -Age_Group, names_to = "Feature", values_to = "Intensity")

# Create boxplots for top features
feature_plots <- list()
n_plots <- min(25, length(top_features))
n_cols <- 5
n_rows <- ceiling(n_plots / n_cols)

for(i in 1:n_plots) {
  feature_name <- top_features[i]
  feature_data <- plot_data %>% filter(Feature == feature_name)
  
  p <- ggplot(feature_data, aes(x = Age_Group, y = Intensity, fill = Age_Group)) +
    geom_boxplot(alpha = 0.7, outlier.shape = NA) +
    geom_jitter(width = 0.2, alpha = 0.7, color = "black") +
    scale_fill_manual(values = used_colors) +
    labs(title = paste("Expression of", feature_name),
         x = "Age Group", y = "Normalized Intensity") +
    theme_minimal() +
    theme(legend.position = "none")
  
  feature_plots[[i]] <- p
}

# Arrange plots in grid
grid.arrange(grobs = feature_plots, ncol = n_cols)

Classification Performance Assessment

The final step evaluates the predictive utility of selected biomarkers through Support Vector Machine classification. The confusion matrix and performance metrics quantify the model’s ability to distinguish between age groups, providing statistical validation of the biomarker panel’s discriminatory power for age-related classification tasks.

Code
# Prepare data for classification
X_top_features <- subset_intensity[, top_features]
y_factor <- as.factor(subset_groups)

# Split data into training and testing sets
set.seed(params$random_seed)
train_idx <- createDataPartition(y_factor, p = 0.8, list = FALSE)
X_train <- X_top_features[train_idx, ]
X_test <- X_top_features[-train_idx, ]
y_train <- y_factor[train_idx]
y_test <- y_factor[-train_idx]

# Train SVM classifier
svm_model <- svm(X_train, y_train, kernel = "linear", probability = TRUE)

# Make predictions
y_pred <- predict(svm_model, X_test)

# Create confusion matrix
conf_matrix <- confusionMatrix(y_pred, y_test)
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction 20-34 80+
     20-34     2   0
     80+       0   2
                                     
               Accuracy : 1          
                 95% CI : (0.3976, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : 0.0625     
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : 20-34      
                                     
Code
# Plot confusion matrix
conf_df <- as.data.frame(conf_matrix$table)
conf_plot <- ggplot(conf_df, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 12) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Confusion Matrix for SVM Classifier") +
  theme_minimal()

print(conf_plot)

Code
# Print accuracy
cat("SVM Classifier Accuracy:", round(conf_matrix$overall['Accuracy'] * 100, 2), "%\n")
SVM Classifier Accuracy: 100 %

Summary Statistics

Code
# Create summary table
summary_stats <- data.frame(
  Metric = c("Total Samples", "Features", "Selected Features", "Training Samples", 
             "Test Samples", "SVM Accuracy"),
  Value = c(nrow(subset_data), ncol(subset_intensity), length(top_features),
            length(y_train), length(y_test), 
            paste0(round(conf_matrix$overall['Accuracy'] * 100, 2), "%"))
)

DT::datatable(summary_stats, options = list(dom = 't'))

Conclusions

The analysis successfully demonstrates that PLS-DA combined with LASSO feature selection can effectively distinguish between young adults (20-34) and elderly individuals (80+) using a reduced panel of 15 discriminative biomarkers, achieving high classification accuracy with SVM. This biomarker-based approach provides a robust framework for age-related molecular profiling and could potentially support clinical applications requiring age group stratification or aging biomarker discovery.

Session Information

Code
# Display renv project information
cat("=== renv Project Information ===\n")
=== renv Project Information ===
Code
renv::status()
There are no packages installed in the project library.
Use `renv::restore()` to install the packages defined in lockfile.
Code
sessionInfo()
R version 4.5.0 (2025-04-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=German_Switzerland.utf8  LC_CTYPE=German_Switzerland.utf8   
[3] LC_MONETARY=German_Switzerland.utf8 LC_NUMERIC=C                       
[5] LC_TIME=German_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pls_2.8-5          RColorBrewer_1.1-3 corrplot_0.95      DT_0.33           
 [5] e1071_1.7-16       plotly_4.10.4      caret_7.0-1        glmnet_4.1-8      
 [9] Matrix_1.7-3       gridExtra_2.3      mixOmics_6.32.0    lattice_0.22-6    
[13] MASS_7.3-65        here_1.0.1         lubridate_1.9.4    forcats_1.0.0     
[17] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.4        readr_2.1.5       
[21] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.2      tidyverse_2.0.0   
[25] renv_1.1.4        

loaded via a namespace (and not attached):
 [1] pROC_1.18.5          rlang_1.1.6          magrittr_2.0.3      
 [4] matrixStats_1.5.0    compiler_4.5.0       vctrs_0.6.5         
 [7] reshape2_1.4.4       pkgconfig_2.0.3      shape_1.4.6.1       
[10] fastmap_1.2.0        labeling_0.4.3       rmarkdown_2.29      
[13] prodlim_2024.06.25   tzdb_0.5.0           xfun_0.52           
[16] cachem_1.1.0         jsonlite_2.0.0       recipes_1.3.1       
[19] BiocParallel_1.42.0  parallel_4.5.0       R6_2.6.1            
[22] bslib_0.9.0          stringi_1.8.4        parallelly_1.42.0   
[25] rpart_4.1.24         jquerylib_0.1.4      Rcpp_1.0.14         
[28] iterators_1.0.14     knitr_1.50           future.apply_1.11.3 
[31] pacman_0.5.1         splines_4.5.0        nnet_7.3-20         
[34] igraph_2.1.4         timechange_0.3.0     tidyselect_1.2.1    
[37] yaml_2.3.10          timeDate_4041.110    codetools_0.2-20    
[40] listenv_0.9.1        plyr_1.8.9           withr_3.0.2         
[43] rARPACK_0.11-0       evaluate_1.0.3       future_1.34.0       
[46] survival_3.8-3       proxy_0.4-27         pillar_1.10.2       
[49] BiocManager_1.30.25  foreach_1.5.2        stats4_4.5.0        
[52] ellipse_0.5.0        generics_0.1.4       rprojroot_2.0.4     
[55] hms_1.1.3            scales_1.4.0         globals_0.18.0      
[58] class_7.3-23         glue_1.6.2           lazyeval_0.2.2      
[61] tools_4.5.0          data.table_1.17.2    RSpectra_0.16-2     
[64] ModelMetrics_1.2.2.2 gower_1.0.2          grid_4.5.0          
[67] crosstalk_1.2.1      ipred_0.9-15         nlme_3.1-168        
[70] cli_3.6.5            viridisLite_0.4.2    lava_1.8.1          
[73] corpcor_1.6.10       gtable_0.3.6         sass_0.4.10         
[76] digest_0.6.37        ggrepel_0.9.6        htmlwidgets_1.6.4   
[79] farver_2.1.2         htmltools_0.5.8.1    lifecycle_1.0.4     
[82] hardhat_1.4.1        httr_1.4.7