Browsing by Author "Ribeiro, Bernardete"
Now showing 1 - 10 of 11
Results Per Page
Sort Options
- Adaptive learning for dynamic environments: A comparative approachPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteNowadays most learning problems demand adaptive solutions. Current challenges include temporal data streams, drift and non-stationary scenarios, often with text data, whether in social networks or in business systems. Various efforts have been pursued in machine learning settings to learn in such environments, specially because of their non-trivial nature, since changes occur between the distribution data used to define the model and the current environment. In this work we present the Drift Adaptive Retain Knowledge (DARK) framework to tackle adaptive learning in dynamic environments based on recent and retained knowledge. DARK handles an ensemble of multiple Support Vector Machine (SVM) models that are dynamically weighted and have distinct training window sizes. A comparative study with benchmark solutions in the field, namely the Learn++.NSE algorithm, is also presented. Experimental results revealed that DARK outperforms Learn++.NSE with two different base classifiers, an SVM and a Classification and Regression Tree (CART).
- Boosting dynamic ensemble’s performance in TwitterPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMany text classification problems in social networks, and other contexts, are also dynamic problems, where concepts drift through time, and meaningful labels are dynamic. In Twitter-based applications in particular, ensembles are often applied to problems that fit this description, for example sentiment analysis or adapting to drifting circumstances. While it can be straightforward to request different classifiers' input on such ensembles, our goal is to boost dynamic ensembles by combining performance metrics as efficiently as possible. We present a twofold performance-based framework to classify incoming tweets based on recent tweets. On the one hand, individual ensemble classifiers' performance is paramount in defining their contribution to the ensemble. On the other hand, examples are actively selected based on their ability to effectively contribute to the performance in classifying drifting concepts. The main step of the algorithm uses different performance metrics to determine both each classifier strength in the ensemble and each example importance, and hence lifetime, in the learning process. We demonstrate, on a drifted benchmark dataset, that our framework drives the classification performance considerably up for it to make a difference in a variety of applications.
- Choice of Best Samples for Building Ensembles in Dynamic EnvironmentsPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMachine learning approaches often focus on optimizing the algorithm rather than assuring that the source data is as rich as possible. However, when it is possible to enhance the input examples to construct models, one should consider it thoroughly. In this work, we propose a technique to define the best set of training examples using dynamic ensembles in text classification scenarios. In dynamic environments, where new data is constantly appearing, old data is usually disregarded, but sometimes some of those disregarded examples may carry substantial information. We propose a method that determines the most relevant examples by analysing their behaviour when defining separating planes or thresholds between classes. Those examples, deemed better than others, are kept for a longer time-window than the rest. Results on a Twitter scenario show that keeping those examples enhances the final classification performance.
- Distributed Text Classification With an Ensemble Kernel-Based Learning ApproachPublication . Silva, Catarina; Lotric, Uros; Ribeiro, Bernardete; Dobnikar, AndrejConstructing a single text classifier that excels in any given application is a rather inviable goal. As a result, ensemble systems are becoming an important resource, since they permit the use of simpler classifiers and the integration of different knowledge in the learning process. However, many text-classification ensemble approaches have an extremely high computational burden, which poses limitations in applications in real environments. Moreover, state-of-the-art kernel-based classifiers, such as support vector machines and relevance vector machines, demand large resources when applied to large databases. Therefore, we propose the use of a new systematic distributed ensemble framework to tackle these challenges, based on a generic deployment strategy in a cluster distributed environment. We employ a combination of both task and data decomposition of the text-classification system, based on partitioning, communication, agglomeration, and mapping to define and optimize a graph of dependent tasks. Additionally, the framework includes an ensemble system where we exploit diverse patterns of errors and gain from the synergies between the ensemble classifiers. The ensemble data partitioning strategy used is shown to improve the performance of baseline state-of-the-art kernel-based machines. The experimental results show that the performance of the proposed framework outperforms standard methods both in speed and classification.
- Enhanced default risk models with SVM+Publication . Ribeiro, Bernardete; Silva, Catarina; Chen, Ning; Vieira, Armando; Carvalho das Neves, JoãoDefault risk models have lately raised a great interest due to the recent world economic crisis. In spite of many advanced techniques that have extensively been proposed, no comprehensive method incorporating a holistic perspective has hitherto been considered. Thus, the existing models for bankruptcy prediction lack the whole coverage of contextual knowledge which may prevent the decision makers such as investors and financial analysts to take the right decisions. Recently, SVM+ provides a formal way to incorporate additional information (not only training data) onto the learning models improving generalization. In financial settings examples of such non-financial (though relevant) information are marketing reports, competitors landscape, economic environment, customers screening, industry trends, etc. By exploiting additional information able to improve classical inductive learning we propose a prediction model where data is naturally separated into several structured groups clustered by the size and annual turnover of the firms. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed default risk model showed better predictability performance than the baseline SVM and multi-task learning with SVM.
- Financial distress model prediction using SVM+Publication . Ribeiro, Bernardete; Silva, Catarina; Vieira, Armando; Gaspar-Cunha, A.; Neves, João C. dasFinancial distress prediction is of great importance to all stakeholders in order to enable better decision-making in evaluating firms. In recent years, the rate of bankruptcy has risen and it is becoming harder to estimate as companies become more complex and the asymmetric information between banks and firms increases. Although a great variety of techniques have been applied along the years, no comprehensive method incorporating an holistic perspective had hitherto been considered. Recently, SVM+ a technique proposed by Vapnik [17] provides a formal way to incorporate privileged information onto the learning models improving generalization. By exploiting additional information to improve traditional inductive learning we propose a prediction model where data is naturally separated into several groups according to the size of the firm. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed model showed superior performance in terms of prediction accuracy in bankruptcy prediction and misclassification cost.
- High-performance bankruptcy prediction model using Graphics Processing UnitsPublication . Ribeiro, Bernardete; Lopes, Noel; Silva, CatarinaIn recent years the the potential and programmability of Graphics Processing Units (GPU) has raised a note-worthy interest in the research community for applications that demand high-computational power. In particular, in financial applications containing thousands of high-dimensional samples, machine learning techniques such as neural networks are often used. One of their main limitations is that the learning phase can be extremely consuming due to the long training times required which constitute a hard bottleneck for their use in practice. Thus their implementation in graphics hardware is highly desirable as a way to speed up the training process. In this paper we present a bankruptcy prediction model based on the parallel implementation of the Multiple BackPropagation (MBP) algorithm which is tested on a real data set of French companies (healthy and bankrupt). Results by running the MBP algorithm in a sequential processing CPU version and in a parallel GPU implementation show reduced computational costs with respect to the latter while yielding very competitive performance.
- Improving Text Classification Performance with Incremental Background KnowledgePublication . Silva, Catarina; Ribeiro, BernardeteText classification is generally the process of extracting interesting and non-trivial information and knowledge from text. One of the main problems with text classification systems is the lack of labeled data, as well as the cost of labeling unlabeled data. Thus, there is a growing interest in exploring the use of unlabeled data as a way to improve classification performance in text classification. The ready availability of this kind of data in most applications makes it an appealing source of information. In this work we propose an Incremental Background Knowledge (IBK) technique to introduce unlabeled data into the training set by expanding it using initial classifiers to deliver oracle decisions. The defined incremental SVM margin-based method was tested in the Reuters-21578 benchmark showing promising results.
- Knowledge Extraction with Non-Negative Matrix Factorization for Text ClassificationPublication . Silva, Catarina; Ribeiro, BernardeteText classification has received increasing interest over the past decades for its wide range of applications driven by the ubiquity of textual information. The high dimensionality of those applications led to pervasive use of dimensionality reduction methods, often black-box feature extraction non-linear techniques. We show how Non-Negative Matrix Factorization (NMF), an algorithm able to learn a parts-based representation of data by imposing non-negativity constraints, can be used to represent and extract knowledge from a text classification problem. The resulting reduced set of features is tested with kernel-based machines on Reuters-21578 benchmark showing the method's performance competitiveness.
- Learning the hash code with generalised regression neural networks for handwritten signature biometric data retrievalPublication . Ribeiro, Bernardete; Lopes, Noel; Silva, CatarinaHandwritten signature recognition is one important component of biometric authentication. This is a central process in a broad range of areas requiring personal identification, such as security, legal contracts and bank transactions. Extensive efforts have been put into the research towards the verification of handwritten signatures, which contain biometric information. Although many successful methods have been used, they often disregard the size of databases, which can be very large, posing scalability problems to their application in real-world scenarios. To overcome this problem, in this paper, we use binary embeddings of high-dimensional data which is an efficient tool for indexing big datasets of biometric images. The rationale is to find a good hash function such that similar data points in Euclidean space preserve their similarities in the resulting Hamming space for fast data retrieval and state-of-the-art classification performance. In the settings of an handwritten signature retrieval system, an indexing hashing-based scheme is presented. We propose to learn k-bits hash code with a generalised regression neural network (GRNN), which yielded competitive results in the GPDS database.
