Percorrer por autor "Antunes, Mário"
A mostrar 1 - 10 de 40
Resultados por página
Opções de ordenação
- Active Manifold Learning with Twitter Big DataPublication . Silva, Catarina; Antunes, Mário; Costa, Joana; Ribeiro, BernardeteThe data produced by Internet applications have increased substantially. Big data is a flaring field that deals with this deluge of data by using storage techniques, dedicated infrastructures and development frameworks for the parallelization of defined tasks and its consequent reduction. These solutions however fall short in online and highly data demanding scenarios, since users expect swift feedback. Reduction techniques are efficiently used in big data online applications to improve classification problems. Reduction in big data usually falls in one of two main methods: (i) reduce the dimensionality by pruning or reformulating the feature set; (ii) reduce the sample size by choosing the most relevant examples. Both approaches have benefits, not only of time consumed to build a model, but eventually also performance-wise, usually by reducing overfitting and improving generalization capabilities. In this paper we investigate reduction techniques that tackle both dimensionality and size of big data. We propose a framework that combines a manifold learning approach to reduce dimensionality and an active learning SVM-based strategy to reduce the size of labeled sample. Results on Twitter data show the potential of the proposed active manifold learning approach.
- Adaptive learning for dynamic environments: A comparative approachPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteNowadays most learning problems demand adaptive solutions. Current challenges include temporal data streams, drift and non-stationary scenarios, often with text data, whether in social networks or in business systems. Various efforts have been pursued in machine learning settings to learn in such environments, specially because of their non-trivial nature, since changes occur between the distribution data used to define the model and the current environment. In this work we present the Drift Adaptive Retain Knowledge (DARK) framework to tackle adaptive learning in dynamic environments based on recent and retained knowledge. DARK handles an ensemble of multiple Support Vector Machine (SVM) models that are dynamically weighted and have distinct training window sizes. A comparative study with benchmark solutions in the field, namely the Learn++.NSE algorithm, is also presented. Experimental results revealed that DARK outperforms Learn++.NSE with two different base classifiers, an SVM and a Classification and Regression Tree (CART).
- An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning ProcessingPublication . Carnaz, Gonçalo; Antunes, Mário; Nogueira, Vitor BeiresCriminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.
- An Artificial Immune System for Temporal Anomaly Detection Using Cell Activation Thresholds and Clonal Size Regulation with HomeostasisPublication . Antunes, Mário; Correia, Manuel E.This paper presents an Artificial Immune System (AIS) based on Grossman's Tunable Activation Threshold (TAT) for anomaly detection. We describe the immunological metaphor and the algorithm adopted for T-cells, emphasizing two important features: the temporal dynamic adjustment of T-cells clonal size and its associated homeostasis mechanism. We present some promising results obtained with artificially generated data sets, aiming to test the appropriateness of using TAT in dynamic changing environments, to distinguish new unseen patterns as part of what should be detected as normal or as anomalous.
- An Automated Repository for the Efficient Management of Complex DocumentationPublication . Frade, José; Antunes, MárioThe accelerating digitalization of the public and private sectors has made information technologies (IT) indispensable in modern life. As services shift to digital platforms and technologies expand across industries, the complexity of legal, regulatory, and technical requirement documentation is growing rapidly. This increase presents significant challenges in managing, gathering, and analyzing documents, as their dispersion across various repositories and formats hinders accessibility and efficient processing. This paper presents the development of an automated repository designed to streamline the collection, classification, and analysis of cybersecurity-related documents. By harnessing the capabilities of natural language processing (NLP) models—specifically Generative Pre-Trained Transformer (GPT) technologies—the system automates text ingestion, extraction, and summarization, providing users with visual tools and organized insights into large volumes of data. The repository facilitates the efficient management of evolving cybersecurity documentation, addressing issues of accessibility, complexity, and time constraints. This paper explores the potential applications of NLP in cybersecurity documentation management and highlights the advantages of integrating automated repositories equipped with visualization and search tools. By focusing on legal documents and technical guidelines from Portugal and the European Union (EU), this applied research seeks to enhance cybersecurity governance, streamline document retrieval, and deliver actionable insights to professionals. Ultimately, the goal is to develop a scalable, adaptable platform capable of extending beyond cybersecurity to serve other industries that rely on the effective management of complex documentation.
- An Automated System for Criminal Police Reports AnalysisPublication . Carnaz, Gonçalo; Nogueira, Vitor B.; Antunes, Mário; Ferreira, NunoInformation Extraction (IE) and fusion are complex fields and have been useful in several domains to deal with heterogeneous data sources. Criminal police are challenged in forensics activities with the extraction, processing and interpretation of numerous documents from different types and with distinct formats (templates), such as narrative criminal reports, police databases and the result of OSINT activities, just to mention a few. Such challenges suggest, among others, to cope with and manually connect some hard to interpret meanings, such as license plates, addresses, names, slang and figures of speech. This paper aims to deal with forensic IE and fusion, thus a system was proposed to automatically extract, transform, clean, load and connect police reports that arrived from different sources. The same system aims to help police investigators to identify and correlate interesting extracted entities.
- Automatic network configuration in virtualized environment using GNS3Publication . Emiliano, Rodrigo; Antunes, MárioComputer networking is a central topic in computer science courses curricula offered by higher education institutions. Network virtualization and simulation tools, like GNS3, allows students and practitioners to test real world networking configuration scenarios and to configure complex network scenarios by configuring virtualized equipments, such as routers and switches, through each one's virtual console. The configuration of advanced network topics in GNS3 requires that students have to apply basic and very repetitive IP configuration tasks in all network equipments. As the network topology grows, so does the amount of network equipments to be configured, which may lead to logical configuration errors. In this paper we propose an extension for GNS3 network virtualizer, to automatically generate a valid configuration of all the network equipments in a GNS3 scenario. Our implementation is able to automatically produce an initial IP and routing configuration of all the Cisco virtual equipments by using the GNS3 specification files. We tested this extension against a set of networked scenarios which proved the robustness, readiness and speedup of the overall configuration tasks. In a learning environment, this feature may save time for all networking practitioners, both beginners or advanced, who aim to configure and test network topologies, since it automatically produces a valid and operational configuration for all the equipments designed in a GNS3 environment.
- Benchmarking bioinspired machine learning algorithms with CSE-CIC-IDS2018 network intrusions datasetPublication . Ferreira, Paulo; Antunes, MárioThis paper aims to evaluate CSE-CIC-IDS2018 network intrusions dataset and benchmark a set of supervised bioinspired machine learning algo rithms, namely CLONALG Artificial Immune System, Learning Vector Quantization (LVQ) and Back-Propagation Multi-Layer Perceptron (MLP). The results obtained were also compared with an ensemble strategy based on a majority voting algorithm. The results obtained show the appropri ateness of using the dataset to test behaviour based network intrusion de tection algorithms and the efficiency of MLP algorithm to detect zero-day attacks, when comparing with CLONALG and LVQ.
- Boosting dynamic ensemble’s performance in TwitterPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMany text classification problems in social networks, and other contexts, are also dynamic problems, where concepts drift through time, and meaningful labels are dynamic. In Twitter-based applications in particular, ensembles are often applied to problems that fit this description, for example sentiment analysis or adapting to drifting circumstances. While it can be straightforward to request different classifiers' input on such ensembles, our goal is to boost dynamic ensembles by combining performance metrics as efficiently as possible. We present a twofold performance-based framework to classify incoming tweets based on recent tweets. On the one hand, individual ensemble classifiers' performance is paramount in defining their contribution to the ensemble. On the other hand, examples are actively selected based on their ability to effectively contribute to the performance in classifying drifting concepts. The main step of the algorithm uses different performance metrics to determine both each classifier strength in the ensemble and each example importance, and hence lifetime, in the learning process. We demonstrate, on a drifted benchmark dataset, that our framework drives the classification performance considerably up for it to make a difference in a variety of applications.
- Choice of Best Samples for Building Ensembles in Dynamic EnvironmentsPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMachine learning approaches often focus on optimizing the algorithm rather than assuring that the source data is as rich as possible. However, when it is possible to enhance the input examples to construct models, one should consider it thoroughly. In this work, we propose a technique to define the best set of training examples using dynamic ensembles in text classification scenarios. In dynamic environments, where new data is constantly appearing, old data is usually disregarded, but sometimes some of those disregarded examples may carry substantial information. We propose a method that determines the most relevant examples by analysing their behaviour when defining separating planes or thresholds between classes. Those examples, deemed better than others, are kept for a longer time-window than the rest. Results on a Twitter scenario show that keeping those examples enhances the final classification performance.
