Repository logo
 

ESTG - Mestrado em Ciência de Dados

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 10 of 47
  • Day-Ahead Energy Consumption Forecasting in Academic Campi with Deep Learning Models
    Publication . Horta, Simão Matos Conceição; Grilo, Carlos Fernando de Almeida; Távora, Luís Miguel de Oliveira Pegado de Noronha e; Sousa, João Miguel Charrua de; Marques, Pedro José Franco
    Electricity is essential in today’s society, with global demand projected to increase 50% by 2050. However, there are significant inefficiencies in power systems throughout production, transmission and distribution. Therefore, to ensure an efficient energy supply it is important to accurately forecast the energy consumption since it allows to deliver only the necessary resources. Advances in data science have provided essential tools to address these challenges by improving the reliability of energy consumption forecasts. Based on these principles, the current project has developed a daily predictive model with a one-day horizon to support energy management decisions, with the goal to optimize the energy efficiency at Campus 2 of the Polytechnic Institute of Leiria (IPL). For this purpose, the historical energy consumption on the campus was used to optimize several models through the NeuralForecast framework. Models were developed using only the endogenous consumption variable, as well as models incorporating the exogenous variables Day Type (which distinguishes between weekdays, Saturdays, and Sundays/holidays) and Academic Calendar (which distinguishes between classes, evaluations, school breaks, and vacation periods). The performance of each model was then evaluated using the Mean Absolute Percentage Error (MAPE) and the computational performance of the model was measured by the Total Parameter Count (TPC). The results show that the best-performing model was N-BEATSx trained on both exogenous variables, achieving a MAPE of 4.92% and 806k total number of parameters. However, the model that best balances complexity and performance is the MultiLayer Perceptron (MLP) with both exogenous variables, with only around 51k parameters and a MAPE of 5.57%. In summary, this study developed robust neural networks for energy consumption forecasting, providing both theoretical and practical advances to support decision-making aimed at optimizing energy efficiency.
  • Process Mining and Power BI for KPI Monitoring in Higher Education Institutions
    Publication . Balitkaia, Valeria; Malheiro, Ricardo Manuel da Silva
    The increasing digitalization of administrative activities in Higher Education Institutions (HEIs) has generated large volumes of process data stored in several information systems. While these systems ensured traceability and accessibility of information, they often lacked systematic monitoring of business process performance, making it difficult to identify inefficiencies, deviations or bottlenecks. This gap makes it difficult for stakeholders to make data-driven decisions and limits the opportunities for continuous improvement. This work addressed the problem of limited visibility over the execution and efficiency of business processes within a HEI by applying Process Mining techniques combined with Business Intelligence (BI) visualizations. The aim was to extract, process and analyze process-related data from the institution’s management information systems, identify relevant performance indicators and provide decision-makers with actionable insights through interactive dashboards. The solution involved process Key Performance Indicators (KPI) definitions, which were grouped into temporal, volume-based, cost efficiency and compliance categories, i.e., KPIs that measure whether the real execution of a process followed the intended process model. Using Process Mining, KPIs related process metrics were discovered and calculated, enabling the identification of deviations and potential optimization points. Beforehand, a validation session with institutional stakeholders occurred, where the practical value of the KPIs for supporting operational and strategic decisions was confirmed. Additionally, the process mining analysis revealed patterns that were previously unknown to managers, reinforcing the benefits of integrating analytical techniques into daily process monitoring. For visualization purposes, three interactive dashboards were developed in Microsoft Power BI, presenting process execution times, workload distribution, cost indicators and other relevant metrics. The developed solution successfully provided a clear, data-driven overview of the institution’s business processes, highlighting areas of inefficiency such as excessive idle times, which affect the costs of each process, and frequent process deviations. In conclusion, the project demonstrated that the combination of Process Mining and Business Intelligence tools can effectively enhance process transparency and performance monitoring in HEIs.
  • Injection Molding Process Monitoring Based on MAchine Learning Algorithms
    Publication . Costa, Pedro Alexandre da Ponte; Mendes, Sílvio Priem; Loureiro, Paulo Jorge Gonçalves
    This thesis addresses the detection of non-conformities in an injection moulding process using unsupervised and semi-supervised learning techniques, with the dual objective of identifying faulty production time zones that prompt the creation of non-conforming parts and enhancing process explainability for machine operators. The early detection of machine parameter deviations, such as abnormal temperature, pressure, torque, and cycle time variations, is essential to minimize economic losses, material waste, and sub-optimal product quality, while maintaining the machine in an optimal production state for longer periods. A diverse array of methods, including Local Outlier Factor (LOF), Isolation Forest (IF), One-Class Support Vector Machine (One-Class SVM), OPTICS, Mean Shift, and Long Short-Term Memory (LSTM) networks, were experimented with to achieve this goal. Extensive feature engineering and selection were performed to balance dimensionality reduction with domain-relevant interpretability, enabling actionable feedback for process optimization. The model pipeline was developed using Python, PyTorch, and Scikit-learn, containerized with Docker and accelerated using CUDA. Real-world sensor data spanning six months of continuous 24/7 operation were used for training and evaluation. The proposed LSTM-based approach, designed for time series modeling, achieved a weighted average F1-score of 0.94 on test data in predicting faulty production time zones of approximately seven-minute intervals. Evaluation metrics included accuracy, precision, and recall, with particular emphasis on the F1-score due to the imbalanced nature of the dataset and the critical need to minimize both false positives and false negatives. A key aspect of this work lies in its commitment to data-driven development, grounded in the use of real, unfiltered sensor data from live industrial production. While raw data introduces noise and operational variability, it also provides a more faithful representation of the production environment, revealing edge cases and failure modes often absent in curated datasets. Addressing these challenges required robust preprocessing, careful validation, and a strong understanding of the process domain, ultimately enabling the development of a model better suited for real-world deployment and generalization. The results demonstrate that this methodology provides a robust baseline for anomaly detection and process monitoring in injection moulding. Contributions include a reproducible framework for explainable unsupervised anomaly detection, validation of LSTM’s effectiveness over static models, and novel feature reduction strategies, such as applying PCA to sensor groups while preserving domain interpretability. This work serves both as a practical tool for deployment and as a methodological reference and compendium of techniques for future research in data-driven industrial process optimization.
  • Anomaly Detection in Numerical Data based on Benford´s Law
    Publication . Martinho, Patrícia Isabel Santos; Santos, Rui Filipe Vargas de Sousa; Antunes, Mário João Gonçalves
    Este projeto centrou-se na deteção de anomalias através da aplicação da lei de Benford, explorando a sua capacidade para identificar desvios estatísticos de forma eficiente e precisa. A abordagem adotada baseou-se nesta lei, amplamente reconhecida pela sua utilidade na deteção de fraudes, especialmente em dados financeiros, ao analisar a distribuição dos primeiros dígitos. A escassez de dados públicos de qualidade dificultava a avaliação rigorosa de modelos estatísticos. Para superar esta limitação, desenvolveu-se um gerador de dados sintéticos parametrizável, capaz de reproduzir padrões correspondentes tanto a eventos normais como a manipulações realistas. A aplicação desenvolvida permitiu simular condições diversas e aproximar os testes a situações do mundo real, facilitando a análise do desempenho e do comportamento dos métodos estatísticos. Com os dados simulados obtidos, tornou-se possível avaliar a eficácia de diferentes métodos estatísticos em condições mais próximas da realidade. Neste contexto, a lei de Benford assumiu um papel central, destacando-se pela sua utilidade na deteção de anomalias em múltiplos cenários. Para explorar de forma sistemática esta capacidade, desenvolveu-se um modelo estatístico como alternativa aos modelos tradicionais de machine learning, que apresentam elevadas taxas de falsos positivos e grandes exigências computacionais. A proposta assentou na aplicação da lei de Benford combinada com medidas de dissemelhança, permitindo quantificar o desvio entre as distribuições observadas e a distribuição esperada segundo esta lei. Realizaram-se simulações com o gerador desenvolvido para criar conjuntos de dados conformes e não conformes com a lei de Benford, obtendo-se assim dados classificados. Para medir o desvio, utilizaram-se o qui-quadrado, o desvio médio absoluto, o teste de Kolmogorov–Smirnov, a distância euclidiana, a distância de Hellinger, a divergência de Kullback-Leibler e a combinação dos valores-𝑝 dos testes através do método de Fisher. O desempenho das diferentes medidas de dissemelhança foi avaliado com recurso a métricas de classificação como a precisão, recall e F1-score, os mesmos critérios utlizados em machine learning, permitindo comparar o desempenho do modelo em estudo com modelos de machine learning. A análise foi complementada pela matriz de confusão e pela curva ROC, ferramentas que permitem uma avaliação mais detalhada do comportamento do modelo, possibilitando a comparação do seu desempenho com o de modelos de machine learning.
  • Criational: Generation of Song Lyrics with Emotional Context Using Deep Learning Models
    Publication . Agostinho, Mariana Oliveira; Malheiro, Ricardo Manuel da Silva
    Language and music are fundamental human tools for expression, communication, and emotional connection. Music plays a central role in shaping identity, conveying feelings, and promoting social bonds, making it an intriguing domain for technological exploration. Replicating human creativity in music, especially in songwriting, presents a complex challenge, as natural language processing (NLP) and deep learning (DL) must capture both linguistic structure and emotional nuance. This study investigates the generation of emotionally contextualised song lyrics using DL models, including LSTM, GPT-2, and T5, guided by Russell’s Circumplex Model of Emotions. The models were evaluated on readability, coherence, perplexity, structural consistency, thematic alignment, and emotional accuracy. Results show that GPT-2, particularly when fine-tuned, achieves the best balance of coherence and emotional alignment, although it still lacks some musical features such as rhyme and rhythm. LSTM exhibits patterned sequences but high variability, while T5 struggles with structural consistency and repetitive output, highlighting the challenges of small, non-specialised datasets. Overall, the work confirms the feasibility of using DL models as creative support in lyric composition, capable of offering emotionally expressive material to inspire musicians, while also pointing to the need for larger datasets and models tailored to musical structure to achieve fully convincing results.
  • Self-Similarity Matrices and Localized Attention for Chorus Recognition: A Data-Efficient Music Information Retrieval Approach
    Publication . Mena, Jose Daniel Luna; Malheiro, Ricardo Manuel da Silva
    This project presents an efficient approach to chorus recognition in English song lyrics that achieves state-of-the-art performance with significantly fewer resources than existing methods. We developed a Bidirectional Long Short-Term Memory (BiLSTM) model with localized attention mechanisms, trained on only 780 songs compared to the 25,000+ songs typically used in Music Information Retrieval research. Our approach addresses class imbalance through comprehensive stabilization techniques and leverages nine feature views capturing structural, semantic, and rhythmic patterns via selfsimilarity matrices. Through systematic experimentation, we demonstrate that chorus detection relies primarily on local contextual patterns rather than global structural awareness, with head self-similarity features (line beginnings) proving most critical for segmentation. The BiLSTM + Attention model achieves 78.2% Macro F1 at the line level, matching Watanabe & Goto's (2020) performance with 100,000+ songs and significantly exceeding Fell et al.'s (2018) 67.4% F1 with 25,000 songs. For boundary detection, the model achieves 59.6% F1 for exact boundaries and 74.7% F1 with ±2 tolerance. The research demonstrates that strategic data curation, comprehensive feature engineering, and targeted optimization can compete effectively with resource-intensive approaches, showing that local pattern recognition outperforms complex global modeling strategies in specialized domains like lyric analysis.
  • The Specialist vs. The Generalist: A Comparative Analysis of Performance and Explainability for Financial Sentiment Classification
    Publication . Roque, Miguel Augusto; Miragaia, Rolando Lúcio Germano; Grilo, Carlos Fernando de Almeida
    The accurate and transparent classification of sentiment in financial texts is a cornerstone of computational finance. This field is currently at a methodological crossroads, dominated by two paradigms: the fine-tuned specialist, represented by domain-adapted models like FinBERT, and the instructed generalist, embodied by modern Large Language Models (LLMs) like Google's Gemini. While performance benchmarks are emerging, a significant research gap exists in the systematic comparison of their performance trade-offs and the nature of their explainability. This dissertation conducts a comparative study between a fine-tuned FinBERT model and the Gemini 2.5 Pro LLM on an extended version of the Financial PhraseBank dataset. The analysis is performed along two axes: (1) Classification Performance, evaluated via metrics robust to class imbalance, and (2) Explainability, where FinBERT's predictions are analyzed using SHapley Additive exPlanations (SHAP). For Gemini, two distinct prompting protocols are compared: a two-step Separated Protocol designed to rigorously test the "overthinking" hypothesis and a single-step Simultaneous Protocol. The results reveal a nuanced performance verdict. While FinBERT excels in accuracy, a key finding is that both Gemini protocols achieve virtually identical performance, challenging the initial "overthinking" hypothesis and suggesting a high degree of robustness in modern LLMs. The qualitative analysis uncovers two distinct reasoning styles: FinBERT's logic is bottom-up and pattern-based, excelling at domain-specific jargon, while Gemini's is top-down and conceptual, grasping holistic meaning but failing on specialized idioms. Ultimately, this work concludes that the choice between a specialist and a generalist is not one of absolute superiority, but a strategic trade-off between accuracy, risk sensitivity, implementation cost, and the desired nature of explainability. This dissertation provides a comprehensive framework for navigating that trade-off.
  • ANALYSIS OF THE REAL IMPACT OF SOCIAL MEDIA AND ONLINE REPUTATION TO IMPROVE MARKETING STRATEGIES IN A HOTEL CHAIN
    Publication . Berrazueta, Juan Andres Coba; Craveiro, Olga Marina Freitas; Sousa, Márcia Cristina Santos Viegas
    The main objective of this research is to design and implement a comprehensive framework that integrates text mining, sentiment analysis, and Business Intelligence (BI) for the analysis of hotel reviews. The study aims to provide hotel managers with a systematic and automated tool capable of transforming unstructured textual data into actionable insights that improve customer satisfaction, enhance online reputation, and support data-driven marketing and operational strategies. This thesis investigates the integration of sentiment analysis, text mining, and BI frameworks as a strategic tool for online reputation management in the hospitality industry. The study combines a systematic literature review, conducted under the PRISMA guidelines, with an empirical project developed according to the CRISP-DM process model. The dataset used comprises all the positive and negative reviews from multiple sources—including Google Reviews, Booking.com, Tripadvisor, and physical surveys—covering five hotels in Portugal during 2023 and 2024. The methodology involved a pipeline of data preparation, including cleaning, deduplication, translation into European Portuguese, normalization, stemming, and lemmatization. Supervised machine learning models, particularly Logistic Regression and Naive Bayes, were implemented and optimized through techniques such as SMOTE and threshold adjustment, demonstrating high accuracy and strong recall for negative comments. Additionally, topic modeling (LDA and NMF) and semantic categorization were applied to extract latent themes and classify reviews into business-relevant categories. Results were operationalized through interactive dashboards in Power BI, which enabled the visualization of satisfaction levels, temporal trends, word frequencies, and category distributions across hotels. These dashboards provided to hotel managers with actionable insights to detect strengths, weaknesses, and seasonal patterns in customer perception. The system was further enhanced with an automated scraping pipeline for Google Reviews, ensuring continuous integration of updated customer feedback. The findings confirm that sentiment analysis and BI tools represent a powerful combination for transforming unstructured textual data into actionable insights. The study demonstrates the feasibility, scalability, and strategic relevance of this approach, while also highlighting limitations related to data availability and semantic overlaps. Ultimately, this work contributes to advancing data-driven decision-making in the hospitality industry.
  • Towards Efficient Classification of Gene Expression Data with Machine Learning
    Publication . Febra, José Leonel de Sousa; Grilo, Carlos Fernando de Almeida; Faria, Paula Cristina Rodrigues Pascoal; Menezes, João Pedro Almeida
    he growing availability of gene expression datasets offers new opportunities for applying machine learning to biological classification. These datasets are typically high-dimensional, limited in sample size, and experimentally diverse, posing both computational and biological challenges. This dissertation investigates how deep learning and classical machine learning models can classify gene expression profiles while evaluating the impact of reducing experimental and computational complexity, thereby lowering associated costs. Three datasets were analysed: GSE3406, with temporal profiles of Saccharomyces species under stress; GSE1723, profiling S. cerevisiae under nutrient limitation and oxygen variation; and GSE6186, recording temporal expression during Drosophila melanogaster embryogenesis. Four models were compared — convolutional neural networks (CNN), long short-term memory networks (LSTMs), support vector machines (SVMs), and XGBoost — with hyperparameters optimised via the Optuna library and performance assessed through repeated experiments. Results show that CNNs achieved the best performance in GSE3406, LSTMs were slightly superior in GSE6186, and CNN and XGBoost performed competitively in GSE1723. Comparable accuracy was often obtained under reduced experimental conditions, such as subsets of stimuli, nutrient regimes, or time points. Additionally, gene-level consistency analysis in GSE3406 identified genes consistently well or poorly classified, supporting dimensionality reduction and biological interpretation. This work demonstrates the potential of deep learning for the classification of gene expression profiles, proposing strategies to simplify experimental design without compromising predictive performance.