Machine learning approach to identify significant genes and classify cancer types from RNA-seq data

2025-11-06

Sultana Akter, Ridwan Olamilekan Adesola, Shreya Basnet,
Machine learning approach to identify significant genes and classify cancer types from RNA-seq data,
Global Medical Genetics,
Volume 12, Issue 4,
2025,
100079,
ISSN 2699-9404,
https://doi.org/10.1016/j.gmg.2025.100079.
(https://www.sciencedirect.com/science/article/pii/S2699940425000803)
Abstract: Cancer remains a leading cause of morbidity and mortality worldwide, with nearly 10 million deaths reported in 2022. In the United States, more than 618,000 deaths are projected to occur in 2025. Traditional methods for identifying cancer types are often time-consuming, labor-intensive, and resource-demanding, highlighting the need for efficient alternatives. This study aimed to evaluate machine learning algorithms on RNA-seq gene expression data to identify statistically significant genes and classify cancer types. We retrieved the PANCAN RNA-seq dataset from the UCI Machine Learning Repository and assessed eight classifiers—Support Vector Machines, K-Nearest Neighbors, AdaBoost, Random Forest, Decision Tree, Quadratic Discriminant Analysis, Naïve Bayes, and Artificial Neural Networks. Model performance was validated using a 70/30 train-test split and 5-fold cross-validation. Among the tested models, the Support Vector Machine achieved the highest classification accuracy of 99.87 % under 5-fold cross-validation. These findings demonstrate the potential of machine learning to efficiently analyze RNA-seq data, facilitate biomarker discovery, and support the development of personalized cancer diagnostics and treatment strategies.
Keywords: Cancer; Diagnosis; Machine learning; RNA seq