Repository logo
 

ESTG - Mestrado em Ciência de Dados

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 10 of 45
  • Injection Molding Process Monitoring Based on MAchine Learning Algorithms
    Publication . Costa, Pedro Alexandre da Ponte; Mendes, Sílvio Priem; Loureiro, Paulo Jorge Gonçalves
    This thesis addresses the detection of non-conformities in an injection moulding process using unsupervised and semi-supervised learning techniques, with the dual objective of identifying faulty production time zones that prompt the creation of non-conforming parts and enhancing process explainability for machine operators. The early detection of machine parameter deviations, such as abnormal temperature, pressure, torque, and cycle time variations, is essential to minimize economic losses, material waste, and sub-optimal product quality, while maintaining the machine in an optimal production state for longer periods. A diverse array of methods, including Local Outlier Factor (LOF), Isolation Forest (IF), One-Class Support Vector Machine (One-Class SVM), OPTICS, Mean Shift, and Long Short-Term Memory (LSTM) networks, were experimented with to achieve this goal. Extensive feature engineering and selection were performed to balance dimensionality reduction with domain-relevant interpretability, enabling actionable feedback for process optimization. The model pipeline was developed using Python, PyTorch, and Scikit-learn, containerized with Docker and accelerated using CUDA. Real-world sensor data spanning six months of continuous 24/7 operation were used for training and evaluation. The proposed LSTM-based approach, designed for time series modeling, achieved a weighted average F1-score of 0.94 on test data in predicting faulty production time zones of approximately seven-minute intervals. Evaluation metrics included accuracy, precision, and recall, with particular emphasis on the F1-score due to the imbalanced nature of the dataset and the critical need to minimize both false positives and false negatives. A key aspect of this work lies in its commitment to data-driven development, grounded in the use of real, unfiltered sensor data from live industrial production. While raw data introduces noise and operational variability, it also provides a more faithful representation of the production environment, revealing edge cases and failure modes often absent in curated datasets. Addressing these challenges required robust preprocessing, careful validation, and a strong understanding of the process domain, ultimately enabling the development of a model better suited for real-world deployment and generalization. The results demonstrate that this methodology provides a robust baseline for anomaly detection and process monitoring in injection moulding. Contributions include a reproducible framework for explainable unsupervised anomaly detection, validation of LSTM’s effectiveness over static models, and novel feature reduction strategies, such as applying PCA to sensor groups while preserving domain interpretability. This work serves both as a practical tool for deployment and as a methodological reference and compendium of techniques for future research in data-driven industrial process optimization.
  • Anomaly Detection in Numerical Data based on Benford´s Law
    Publication . Martinho, Patrícia Isabel Santos; Santos, Rui Filipe Vargas de Sousa; Antunes, Mário João Gonçalves
    Este projeto centrou-se na deteção de anomalias através da aplicação da lei de Benford, explorando a sua capacidade para identificar desvios estatísticos de forma eficiente e precisa. A abordagem adotada baseou-se nesta lei, amplamente reconhecida pela sua utilidade na deteção de fraudes, especialmente em dados financeiros, ao analisar a distribuição dos primeiros dígitos. A escassez de dados públicos de qualidade dificultava a avaliação rigorosa de modelos estatísticos. Para superar esta limitação, desenvolveu-se um gerador de dados sintéticos parametrizável, capaz de reproduzir padrões correspondentes tanto a eventos normais como a manipulações realistas. A aplicação desenvolvida permitiu simular condições diversas e aproximar os testes a situações do mundo real, facilitando a análise do desempenho e do comportamento dos métodos estatísticos. Com os dados simulados obtidos, tornou-se possível avaliar a eficácia de diferentes métodos estatísticos em condições mais próximas da realidade. Neste contexto, a lei de Benford assumiu um papel central, destacando-se pela sua utilidade na deteção de anomalias em múltiplos cenários. Para explorar de forma sistemática esta capacidade, desenvolveu-se um modelo estatístico como alternativa aos modelos tradicionais de machine learning, que apresentam elevadas taxas de falsos positivos e grandes exigências computacionais. A proposta assentou na aplicação da lei de Benford combinada com medidas de dissemelhança, permitindo quantificar o desvio entre as distribuições observadas e a distribuição esperada segundo esta lei. Realizaram-se simulações com o gerador desenvolvido para criar conjuntos de dados conformes e não conformes com a lei de Benford, obtendo-se assim dados classificados. Para medir o desvio, utilizaram-se o qui-quadrado, o desvio médio absoluto, o teste de Kolmogorov–Smirnov, a distância euclidiana, a distância de Hellinger, a divergência de Kullback-Leibler e a combinação dos valores-𝑝 dos testes através do método de Fisher. O desempenho das diferentes medidas de dissemelhança foi avaliado com recurso a métricas de classificação como a precisão, recall e F1-score, os mesmos critérios utlizados em machine learning, permitindo comparar o desempenho do modelo em estudo com modelos de machine learning. A análise foi complementada pela matriz de confusão e pela curva ROC, ferramentas que permitem uma avaliação mais detalhada do comportamento do modelo, possibilitando a comparação do seu desempenho com o de modelos de machine learning.
  • Criational: Generation of Song Lyrics with Emotional Context Using Deep Learning Models
    Publication . Agostinho, Mariana Oliveira; Malheiro, Ricardo Manuel da Silva
    Language and music are fundamental human tools for expression, communication, and emotional connection. Music plays a central role in shaping identity, conveying feelings, and promoting social bonds, making it an intriguing domain for technological exploration. Replicating human creativity in music, especially in songwriting, presents a complex challenge, as natural language processing (NLP) and deep learning (DL) must capture both linguistic structure and emotional nuance. This study investigates the generation of emotionally contextualised song lyrics using DL models, including LSTM, GPT-2, and T5, guided by Russell’s Circumplex Model of Emotions. The models were evaluated on readability, coherence, perplexity, structural consistency, thematic alignment, and emotional accuracy. Results show that GPT-2, particularly when fine-tuned, achieves the best balance of coherence and emotional alignment, although it still lacks some musical features such as rhyme and rhythm. LSTM exhibits patterned sequences but high variability, while T5 struggles with structural consistency and repetitive output, highlighting the challenges of small, non-specialised datasets. Overall, the work confirms the feasibility of using DL models as creative support in lyric composition, capable of offering emotionally expressive material to inspire musicians, while also pointing to the need for larger datasets and models tailored to musical structure to achieve fully convincing results.
  • Self-Similarity Matrices and Localized Attention for Chorus Recognition: A Data-Efficient Music Information Retrieval Approach
    Publication . Mena, Jose Daniel Luna; Malheiro, Ricardo Manuel da Silva
    This project presents an efficient approach to chorus recognition in English song lyrics that achieves state-of-the-art performance with significantly fewer resources than existing methods. We developed a Bidirectional Long Short-Term Memory (BiLSTM) model with localized attention mechanisms, trained on only 780 songs compared to the 25,000+ songs typically used in Music Information Retrieval research. Our approach addresses class imbalance through comprehensive stabilization techniques and leverages nine feature views capturing structural, semantic, and rhythmic patterns via selfsimilarity matrices. Through systematic experimentation, we demonstrate that chorus detection relies primarily on local contextual patterns rather than global structural awareness, with head self-similarity features (line beginnings) proving most critical for segmentation. The BiLSTM + Attention model achieves 78.2% Macro F1 at the line level, matching Watanabe & Goto's (2020) performance with 100,000+ songs and significantly exceeding Fell et al.'s (2018) 67.4% F1 with 25,000 songs. For boundary detection, the model achieves 59.6% F1 for exact boundaries and 74.7% F1 with ±2 tolerance. The research demonstrates that strategic data curation, comprehensive feature engineering, and targeted optimization can compete effectively with resource-intensive approaches, showing that local pattern recognition outperforms complex global modeling strategies in specialized domains like lyric analysis.
  • The Specialist vs. The Generalist: A Comparative Analysis of Performance and Explainability for Financial Sentiment Classification
    Publication . Roque, Miguel Augusto; Miragaia, Rolando Lúcio Germano; Grilo, Carlos Fernando de Almeida
    The accurate and transparent classification of sentiment in financial texts is a cornerstone of computational finance. This field is currently at a methodological crossroads, dominated by two paradigms: the fine-tuned specialist, represented by domain-adapted models like FinBERT, and the instructed generalist, embodied by modern Large Language Models (LLMs) like Google's Gemini. While performance benchmarks are emerging, a significant research gap exists in the systematic comparison of their performance trade-offs and the nature of their explainability. This dissertation conducts a comparative study between a fine-tuned FinBERT model and the Gemini 2.5 Pro LLM on an extended version of the Financial PhraseBank dataset. The analysis is performed along two axes: (1) Classification Performance, evaluated via metrics robust to class imbalance, and (2) Explainability, where FinBERT's predictions are analyzed using SHapley Additive exPlanations (SHAP). For Gemini, two distinct prompting protocols are compared: a two-step Separated Protocol designed to rigorously test the "overthinking" hypothesis and a single-step Simultaneous Protocol. The results reveal a nuanced performance verdict. While FinBERT excels in accuracy, a key finding is that both Gemini protocols achieve virtually identical performance, challenging the initial "overthinking" hypothesis and suggesting a high degree of robustness in modern LLMs. The qualitative analysis uncovers two distinct reasoning styles: FinBERT's logic is bottom-up and pattern-based, excelling at domain-specific jargon, while Gemini's is top-down and conceptual, grasping holistic meaning but failing on specialized idioms. Ultimately, this work concludes that the choice between a specialist and a generalist is not one of absolute superiority, but a strategic trade-off between accuracy, risk sensitivity, implementation cost, and the desired nature of explainability. This dissertation provides a comprehensive framework for navigating that trade-off.
  • ANALYSIS OF THE REAL IMPACT OF SOCIAL MEDIA AND ONLINE REPUTATION TO IMPROVE MARKETING STRATEGIES IN A HOTEL CHAIN
    Publication . Berrazueta, Juan Andres Coba; Craveiro, Olga Marina Freitas; Sousa, Márcia Cristina Santos Viegas
    The main objective of this research is to design and implement a comprehensive framework that integrates text mining, sentiment analysis, and Business Intelligence (BI) for the analysis of hotel reviews. The study aims to provide hotel managers with a systematic and automated tool capable of transforming unstructured textual data into actionable insights that improve customer satisfaction, enhance online reputation, and support data-driven marketing and operational strategies. This thesis investigates the integration of sentiment analysis, text mining, and BI frameworks as a strategic tool for online reputation management in the hospitality industry. The study combines a systematic literature review, conducted under the PRISMA guidelines, with an empirical project developed according to the CRISP-DM process model. The dataset used comprises all the positive and negative reviews from multiple sources—including Google Reviews, Booking.com, Tripadvisor, and physical surveys—covering five hotels in Portugal during 2023 and 2024. The methodology involved a pipeline of data preparation, including cleaning, deduplication, translation into European Portuguese, normalization, stemming, and lemmatization. Supervised machine learning models, particularly Logistic Regression and Naive Bayes, were implemented and optimized through techniques such as SMOTE and threshold adjustment, demonstrating high accuracy and strong recall for negative comments. Additionally, topic modeling (LDA and NMF) and semantic categorization were applied to extract latent themes and classify reviews into business-relevant categories. Results were operationalized through interactive dashboards in Power BI, which enabled the visualization of satisfaction levels, temporal trends, word frequencies, and category distributions across hotels. These dashboards provided to hotel managers with actionable insights to detect strengths, weaknesses, and seasonal patterns in customer perception. The system was further enhanced with an automated scraping pipeline for Google Reviews, ensuring continuous integration of updated customer feedback. The findings confirm that sentiment analysis and BI tools represent a powerful combination for transforming unstructured textual data into actionable insights. The study demonstrates the feasibility, scalability, and strategic relevance of this approach, while also highlighting limitations related to data availability and semantic overlaps. Ultimately, this work contributes to advancing data-driven decision-making in the hospitality industry.
  • Towards Efficient Classification of Gene Expression Data with Machine Learning
    Publication . Febra, José Leonel de Sousa; Grilo, Carlos Fernando de Almeida; Faria, Paula Cristina Rodrigues Pascoal; Menezes, João Pedro Almeida
    he growing availability of gene expression datasets offers new opportunities for applying machine learning to biological classification. These datasets are typically high-dimensional, limited in sample size, and experimentally diverse, posing both computational and biological challenges. This dissertation investigates how deep learning and classical machine learning models can classify gene expression profiles while evaluating the impact of reducing experimental and computational complexity, thereby lowering associated costs. Three datasets were analysed: GSE3406, with temporal profiles of Saccharomyces species under stress; GSE1723, profiling S. cerevisiae under nutrient limitation and oxygen variation; and GSE6186, recording temporal expression during Drosophila melanogaster embryogenesis. Four models were compared — convolutional neural networks (CNN), long short-term memory networks (LSTMs), support vector machines (SVMs), and XGBoost — with hyperparameters optimised via the Optuna library and performance assessed through repeated experiments. Results show that CNNs achieved the best performance in GSE3406, LSTMs were slightly superior in GSE6186, and CNN and XGBoost performed competitively in GSE1723. Comparable accuracy was often obtained under reduced experimental conditions, such as subsets of stimuli, nutrient regimes, or time points. Additionally, gene-level consistency analysis in GSE3406 identified genes consistently well or poorly classified, supporting dimensionality reduction and biological interpretation. This work demonstrates the potential of deep learning for the classification of gene expression profiles, proposing strategies to simplify experimental design without compromising predictive performance.
  • Algoritmos de Classificação Multiclasse via SVM para Apoio ao Diagnóstico a partir de Dados do Teste de Esforço Cardiopulmonar (CPET)
    Publication . Santos, Flávio Bueno dos; Pinheiro, Rafael Fernandes; Pinto, Rui Manuel da Fonseca
    O Teste de Esforço Cardiopulmonar (CPET) é uma ferramenta vital para o diagnóstico funcional, mas a aplicação de modelos de aprendizagem automática é frequentemente limitada pela escassez de séries temporais completas para diversas condições clínicas. Esta dissertação aborda esta lacuna, partindo de um framework de classificação baseado em Support Vector Machines (SVM) e na Transformada Wavelet Discreta (DWT), originalmente desenvolvido para três classes (Insuficiência Cardíaca (IC), Síndrome Metabólica (SM) e Saudáveis (S)). O objetivo central foi expandir esta metodologia para um cenário mais complexo de cinco classes, através da geração de Dados Semi-Sintéticos para os grupos de Limitação Pulmonar (LP) e Limitação Musculoesquelética (LM), guiada por parâmetros estatísticos de pacientes reais. Subsequentemente, para validar a eficácia da Transformada Wavelet Discreta (Discrete Wavelet Transform) (DWT) neste novo contexto, foi conduzida uma análise comparativa, avaliando o desempenho do modelo contra três métodos alternativos de extração de características: a Transformada de Fourier de Curto Tempo (Short-Time Fourier Transform) (STFT), a Transformada Wavelet por Pacotes (Wavelet Packet Transform) (WPT) e a Decomposição por Modos Empíricos (Empirical Mode Decomposition) (EMD). Todos os modelos foram avaliados sob um protocolo experimental consistente para garantir uma comparação justa. Os resultados da análise comparativa foram consistentes. O modelo SVM-Linear-MW5, que utiliza a DWT, alcançou uma acurácia de 93.60% e um F1-Score de 84.14%, um desempenho que se destacou em relação ao das outras transformadas. A análise demonstrou que a STFT foi a alternativa mais competitiva (F1-Score de 74.25%), enquanto a WPT e a EMD não se mostraram tão eficazes para este problema. Este trabalho conclui que a combinação de dados semi-sintéticos com a extração de características via DWT é uma abordagem viável para a expansão de modelos de diagnóstico. A metodologia de referência foi expandida para cinco classes e, na análise comparativa realizada, a sua abordagem de processamento de sinal obteve o desempenho mais elevado entre as técnicas testadas, o que estabelece um baseline sólido para futuras investigações na área, incluindo a otimização de Hiperparâmetros.
  • Efeitos do mismatch educacional nos salários no mercado de trabalho português: uma análise dos Quadros de Pessoal (2010–2023)
    Oliveira, David Santos; Santos, Rui Filipe Vargas de Sousa; Lopes, Ana Sofia Patrício Pinto
    O presente trabalho examina a relação entre a qualificação dos trabalhadores e os rendimentos reais no mercado de trabalho português, centrando-se no fenómeno do mismatch educacional, nomeadamente a discrepância entre o nível de escolaridade do indivíduo e o requerido pela sua profissão, e analisa as suas implicações salariais. Utilizando dados longitudinais dos Quadros de Pessoal entre 2010 e 2023, a investigação combina duas abordagens metodológicas complementares: regressões lineares anuais com base em dados cross-sectional, focando mais nas diferenças salariais entre trabalhadores, e modelos de dados em painel com efeitos fixos, permitindo captar variações temporais e controlar a heterogeneidade não observada dos indivíduos. As variáveis analisadas incluem fatores demográficos, de capital humano, empresariais, regionais, setoriais e contratuais, beneficiando da utilização dos Quadros de Pessoal, uma base de dados administrativa que cobre todos os trabalhadores do setor privado em Portugal. A dimensão desta base permite trabalhar com milhões de observações anuais, assegurando representatividade, bem como uma grande riqueza em termos de diversidade de variáveis disponíveis. Os resultados evidenciam que a escolaridade, o mismatch educacional, a produtividade e a dimensão da empresa são determinantes salariais robustos, enquanto características como género apresentam efeitos mais complexos e dependentes do modelo estimado. A comparação entre métodos mostra que o modelo em painel tende a reduzir a magnitude dos coeficientes. Esta diferença decorre não apenas do controlo de heterogeneidade não observada através dos efeitos fixos, mas também do facto de o painel captar a evolução do rendimento do mesmo indivíduo ao longo do tempo, proporcionando uma análise dinâmica das relações entre variáveis. Este trabalho contribui para o aprofundamento do conhecimento sobre desigualdades salariais associadas ao mismatch educacional em Portugal, fornecendo evidência empírica útil para o desenho de políticas públicas que promovam uma melhor adequação entre qualificações e funções laborais.