ESTG - Mestrado em Ciência de Dados
Permanent URI for this collection
Browse
Browsing ESTG - Mestrado em Ciência de Dados by advisor "Faria, Paula Cristina Rodrigues Pascoal"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- Towards Efficient Classification of Gene Expression Data with Machine LearningPublication . Febra, José Leonel de Sousa; Grilo, Carlos Fernando de Almeida; Faria, Paula Cristina Rodrigues Pascoal; Menezes, João Pedro Almeidahe growing availability of gene expression datasets offers new opportunities for applying machine learning to biological classification. These datasets are typically high-dimensional, limited in sample size, and experimentally diverse, posing both computational and biological challenges. This dissertation investigates how deep learning and classical machine learning models can classify gene expression profiles while evaluating the impact of reducing experimental and computational complexity, thereby lowering associated costs. Three datasets were analysed: GSE3406, with temporal profiles of Saccharomyces species under stress; GSE1723, profiling S. cerevisiae under nutrient limitation and oxygen variation; and GSE6186, recording temporal expression during Drosophila melanogaster embryogenesis. Four models were compared — convolutional neural networks (CNN), long short-term memory networks (LSTMs), support vector machines (SVMs), and XGBoost — with hyperparameters optimised via the Optuna library and performance assessed through repeated experiments. Results show that CNNs achieved the best performance in GSE3406, LSTMs were slightly superior in GSE6186, and CNN and XGBoost performed competitively in GSE1723. Comparable accuracy was often obtained under reduced experimental conditions, such as subsets of stimuli, nutrient regimes, or time points. Additionally, gene-level consistency analysis in GSE3406 identified genes consistently well or poorly classified, supporting dimensionality reduction and biological interpretation. This work demonstrates the potential of deep learning for the classification of gene expression profiles, proposing strategies to simplify experimental design without compromising predictive performance.
