Name
#132 Defining Disease Subtypes: A Generalized, Robust Gene-Clinical Workflow for Endotype Determination and Precision Medicine Stratification.
Content Presented On Behalf Of:
Uniformed Services University
Services/Agencies represented
Other/Not Listed
Session Type
Poster
Date
Tuesday, March 3, 2026
Start Time
5:00 PM
End Time
7:00 PM
Location
Prince Georges Expo Hall E
Focus Areas/Topics
Technology
Learning Outcomes
Outline the critical steps of a multi-stage endotype workflow, including the roles of Consensus Clustering, Rank-Rank Hypergeometric Overlap analysis, and Pathway Analysis in defining biological subtypes.
Evaluate the clinical utility of newly discovered endotypes by correlating molecular signatures with patient outcomes and prognostic factors (e.g., mortality or disease severity).
Select appropriate machine learning techniques (beyond simple feature selection) to integrate and prioritize the predictive value of both molecular and clinical covariates in a final model.
Justify the necessity of external validation in an independent dataset (e.g., microarray data) to ensure the generalizability of the clinical-gene classifier.
Session Currently Live
Description
The diagnostic challenge posed by complex, heterogeneous human diseases is exacerbated by reliance on broad, symptom-based classifications that fail to account for underlying pathophysiological diversity. To fully realize the promise of precision medicine, there is a critical need to define stable, mechanism-based subtypes of diseases, or endotypes. This study describes a detailed, multi-stage computational and statistical workflow designed to systematically discover, rigorously characterize, and validate molecular endotypes using omics data (such as RNA-sequencing). The analytic process begins with data preprocessing and enacting quality control checks. Batch effects, a common source of bias in studies with omics data, are mitigated through the application of ComBat-seq. The analysis begins with unsupervised clustering achieved through ConsensusClusterPlus employed across normalized data to identify intrinsic endotype clusters. This method determines the optimal number of endotypes and simultaneously provides a stability metric, which quantifies the reliability of the resulting subtypes. Next, their clinical significance is assessed through a comprehensive Clinical Association Analysis. This involves statistical comparisons of clinical variables of consequence between endotypes (using tests like Kruskal-Wallis or Chi-square) and, critically, assessing differences in patient outcomes (e.g., length of stay, disease severity, or mortality analysis) via appropriate regression techniques, contingent on the availability of such data. To biologically characterize the endotypes, a Differential Expression Analysis, accounting for multiple comparisons, is conducted to identify gene expression profiles unique to each subtype. A threshold-free comparison of the full transcriptional signatures between endotypes is performed using Rank-Rank Hypergeometric Overlap analysis, which identifies the global pattern of transcriptional similarity or divergence. This mechanistic understanding is further solidified through Gene Ontology enrichment analysis to functionally define the distinct underlying cellular and molecular pathophysiology of each endotype. The next analytic step is Translational Classifier Development. Feature selection begins by identifying the most differentially expressed genes in pairwise comparisons across the identified endotypes. Elastic net regularization is then applied to the gene expression data to select a parsimonious molecular signature. The resulting streamlined set of high-value genes is then combined with all available clinical covariates to create a comprehensive feature matrix. This integrated grouping of genetic and clinical factors is subsequently input into a Gradient Boosting Machine (GBM) model, which can capture non-linearities in the data, to predict the identified endotypes. Crucially, the GBM's Feature Importance metric provides an objective ranking of all inputs, allowing researchers to extract an integrated list of molecular and clinical factors that maximize predictive accuracy for clinical translation. Finally, a final multinomial model is built using this combined gene-clinical feature set, with performance assessment via cross-validation. The selection of ranked variables for model inclusion can be iterated to minimize the variable set while being responsive to the scientific question. The workflow accommodates external validation by applying the classifier to independent omics data, if available, to confirm the existence of the identified endotypes and quantify the classifier’s performance. A dataset of trauma patients with cardiovascular disorder serves as a test case for this workflow. Overall, our generalized approach provides an essential, powerful framework for mechanism-based disease stratification.