Browsing by Author "Ribeiro, Bernardete"
Now showing 1 - 10 of 14
Results Per Page
Sort Options
- Active Manifold Learning with Twitter Big DataPublication . Silva, Catarina; Antunes, Mário; Costa, Joana; Ribeiro, BernardeteThe data produced by Internet applications have increased substantially. Big data is a flaring field that deals with this deluge of data by using storage techniques, dedicated infrastructures and development frameworks for the parallelization of defined tasks and its consequent reduction. These solutions however fall short in online and highly data demanding scenarios, since users expect swift feedback. Reduction techniques are efficiently used in big data online applications to improve classification problems. Reduction in big data usually falls in one of two main methods: (i) reduce the dimensionality by pruning or reformulating the feature set; (ii) reduce the sample size by choosing the most relevant examples. Both approaches have benefits, not only of time consumed to build a model, but eventually also performance-wise, usually by reducing overfitting and improving generalization capabilities. In this paper we investigate reduction techniques that tackle both dimensionality and size of big data. We propose a framework that combines a manifold learning approach to reduce dimensionality and an active learning SVM-based strategy to reduce the size of labeled sample. Results on Twitter data show the potential of the proposed active manifold learning approach.
- Adaptive learning for dynamic environments: A comparative approachPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteNowadays most learning problems demand adaptive solutions. Current challenges include temporal data streams, drift and non-stationary scenarios, often with text data, whether in social networks or in business systems. Various efforts have been pursued in machine learning settings to learn in such environments, specially because of their non-trivial nature, since changes occur between the distribution data used to define the model and the current environment. In this work we present the Drift Adaptive Retain Knowledge (DARK) framework to tackle adaptive learning in dynamic environments based on recent and retained knowledge. DARK handles an ensemble of multiple Support Vector Machine (SVM) models that are dynamically weighted and have distinct training window sizes. A comparative study with benchmark solutions in the field, namely the Learn++.NSE algorithm, is also presented. Experimental results revealed that DARK outperforms Learn++.NSE with two different base classifiers, an SVM and a Classification and Regression Tree (CART).
- Boosting dynamic ensemble’s performance in TwitterPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMany text classification problems in social networks, and other contexts, are also dynamic problems, where concepts drift through time, and meaningful labels are dynamic. In Twitter-based applications in particular, ensembles are often applied to problems that fit this description, for example sentiment analysis or adapting to drifting circumstances. While it can be straightforward to request different classifiers' input on such ensembles, our goal is to boost dynamic ensembles by combining performance metrics as efficiently as possible. We present a twofold performance-based framework to classify incoming tweets based on recent tweets. On the one hand, individual ensemble classifiers' performance is paramount in defining their contribution to the ensemble. On the other hand, examples are actively selected based on their ability to effectively contribute to the performance in classifying drifting concepts. The main step of the algorithm uses different performance metrics to determine both each classifier strength in the ensemble and each example importance, and hence lifetime, in the learning process. We demonstrate, on a drifted benchmark dataset, that our framework drives the classification performance considerably up for it to make a difference in a variety of applications.
- Choice of Best Samples for Building Ensembles in Dynamic EnvironmentsPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMachine learning approaches often focus on optimizing the algorithm rather than assuring that the source data is as rich as possible. However, when it is possible to enhance the input examples to construct models, one should consider it thoroughly. In this work, we propose a technique to define the best set of training examples using dynamic ensembles in text classification scenarios. In dynamic environments, where new data is constantly appearing, old data is usually disregarded, but sometimes some of those disregarded examples may carry substantial information. We propose a method that determines the most relevant examples by analysing their behaviour when defining separating planes or thresholds between classes. Those examples, deemed better than others, are kept for a longer time-window than the rest. Results on a Twitter scenario show that keeping those examples enhances the final classification performance.
- Distributed Text Classification With an Ensemble Kernel-Based Learning ApproachPublication . Silva, Catarina; Lotric, Uros; Ribeiro, Bernardete; Dobnikar, AndrejConstructing a single text classifier that excels in any given application is a rather inviable goal. As a result, ensemble systems are becoming an important resource, since they permit the use of simpler classifiers and the integration of different knowledge in the learning process. However, many text-classification ensemble approaches have an extremely high computational burden, which poses limitations in applications in real environments. Moreover, state-of-the-art kernel-based classifiers, such as support vector machines and relevance vector machines, demand large resources when applied to large databases. Therefore, we propose the use of a new systematic distributed ensemble framework to tackle these challenges, based on a generic deployment strategy in a cluster distributed environment. We employ a combination of both task and data decomposition of the text-classification system, based on partitioning, communication, agglomeration, and mapping to define and optimize a graph of dependent tasks. Additionally, the framework includes an ensemble system where we exploit diverse patterns of errors and gain from the synergies between the ensemble classifiers. The ensemble data partitioning strategy used is shown to improve the performance of baseline state-of-the-art kernel-based machines. The experimental results show that the performance of the proposed framework outperforms standard methods both in speed and classification.
- Enhanced default risk models with SVM+Publication . Ribeiro, Bernardete; Silva, Catarina; Chen, Ning; Vieira, Armando; Carvalho das Neves, JoãoDefault risk models have lately raised a great interest due to the recent world economic crisis. In spite of many advanced techniques that have extensively been proposed, no comprehensive method incorporating a holistic perspective has hitherto been considered. Thus, the existing models for bankruptcy prediction lack the whole coverage of contextual knowledge which may prevent the decision makers such as investors and financial analysts to take the right decisions. Recently, SVM+ provides a formal way to incorporate additional information (not only training data) onto the learning models improving generalization. In financial settings examples of such non-financial (though relevant) information are marketing reports, competitors landscape, economic environment, customers screening, industry trends, etc. By exploiting additional information able to improve classical inductive learning we propose a prediction model where data is naturally separated into several structured groups clustered by the size and annual turnover of the firms. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed default risk model showed better predictability performance than the baseline SVM and multi-task learning with SVM.
- Financial distress model prediction using SVM+Publication . Ribeiro, Bernardete; Silva, Catarina; Vieira, Armando; Gaspar-Cunha, A.; Neves, João C. dasFinancial distress prediction is of great importance to all stakeholders in order to enable better decision-making in evaluating firms. In recent years, the rate of bankruptcy has risen and it is becoming harder to estimate as companies become more complex and the asymmetric information between banks and firms increases. Although a great variety of techniques have been applied along the years, no comprehensive method incorporating an holistic perspective had hitherto been considered. Recently, SVM+ a technique proposed by Vapnik [17] provides a formal way to incorporate privileged information onto the learning models improving generalization. By exploiting additional information to improve traditional inductive learning we propose a prediction model where data is naturally separated into several groups according to the size of the firm. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed model showed superior performance in terms of prediction accuracy in bankruptcy prediction and misclassification cost.
- High-performance bankruptcy prediction model using Graphics Processing UnitsPublication . Ribeiro, Bernardete; Lopes, Noel; Silva, CatarinaIn recent years the the potential and programmability of Graphics Processing Units (GPU) has raised a note-worthy interest in the research community for applications that demand high-computational power. In particular, in financial applications containing thousands of high-dimensional samples, machine learning techniques such as neural networks are often used. One of their main limitations is that the learning phase can be extremely consuming due to the long training times required which constitute a hard bottleneck for their use in practice. Thus their implementation in graphics hardware is highly desirable as a way to speed up the training process. In this paper we present a bankruptcy prediction model based on the parallel implementation of the Multiple BackPropagation (MBP) algorithm which is tested on a real data set of French companies (healthy and bankrupt). Results by running the MBP algorithm in a sequential processing CPU version and in a parallel GPU implementation show reduced computational costs with respect to the latter while yielding very competitive performance.
- A Hybrid AIS-SVM Ensemble Approach for Text ClassificationPublication . Antunes, Mário; Silva, Catarina; Ribeiro, Bernardete; Correia, ManuelIn this paper we propose and analyse methods for expanding state-of-the-art performance on text classification. We put forward an ensemble-based structure that includes Support Vector Machines (SVM) and Artificial Immune Systems (AIS). The underpinning idea is that SVM-like approaches can be enhanced with AIS approaches which can capture dynamics in models. While having radically different genesis, and probably because of that, SVM and AIS can cooperate in a committee setting, using a heterogeneous ensemble to improve overall performance, including a confidence on each system classification as the differentiating factor. Results on the well-known Reuters-21578 benchmark are presented, showing promising classification performance gains, resulting in a classification that improves upon all baseline contributors of the ensemble committee.
- Improving recall values in breast cancer diagnosis with Incremental Background KnowledgePublication . Silva, Catarina; Ribeiro, Bernardete; Lopes, NoelCancer diagnosis is generally the process of using some form of physical or genetic tests or exams, usually referred as patient data, to detect the disease. One of the main problems with cancer diagnosis systems is the lack of labeled data, as well as the difficulties of labeling pre-existing unlabeled data. Thus, there is a growing interest in exploring the use of unlabeled data as a way to improve classification performance in cancer diagnosis. The possible availability of this kind of data for some applications makes it an appealing source of information. In this work we explore an Incremental Background Knowledge (IBK) technique to introduce unlabeled data into the training set by expanding it using initial classifiers to better aid decisions, namely by improving recall values. The defined incremental SVM margin-based method was tested in the Wisconsin-Madison breast cancer diagnosis problem to examine the effectiveness of such techniques in supporting diagnosis.
