Loading...
5 results
Search Results
Now showing 1 - 5 of 5
- Choice of Best Samples for Building Ensembles in Dynamic EnvironmentsPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMachine learning approaches often focus on optimizing the algorithm rather than assuring that the source data is as rich as possible. However, when it is possible to enhance the input examples to construct models, one should consider it thoroughly. In this work, we propose a technique to define the best set of training examples using dynamic ensembles in text classification scenarios. In dynamic environments, where new data is constantly appearing, old data is usually disregarded, but sometimes some of those disregarded examples may carry substantial information. We propose a method that determines the most relevant examples by analysing their behaviour when defining separating planes or thresholds between classes. Those examples, deemed better than others, are kept for a longer time-window than the rest. Results on a Twitter scenario show that keeping those examples enhances the final classification performance.
- The impact of longstanding messages in micro-blogging classificationPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Bernardete RibeiroSocial networks are making part of the daily routine of millions of users. Twitter is among Facebook and Instagram one of the most used, and can be seen as a relevant source of information as users share not only daily status, but rapidly propagate news and events that occur worldwide. Considering the dynamic nature of social networks, and their potential in information spread, it is imperative to find learning strategies able to learn in these environments and cope with their dynamic nature. Time plays an important role by easily out-dating information, being crucial to understand how informative can past events be to current learning models and for how long it is relevant to store previously seen information, to avoid the computation burden associated with the amount of data produced. In this paper we study the impact of longstanding messages in micro-blogging classification by using different training timewindow sizes in the learning process. Since there are few studies dealing with drift in Twitter and thus little is known about the types of drift that may occur, we simulate different types of drift in an artificial dataset to evaluate and validate our strategy. Results shed light on the relevance of previously seen examples according to different types of drift.
- Boosting dynamic ensemble’s performance in TwitterPublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteMany text classification problems in social networks, and other contexts, are also dynamic problems, where concepts drift through time, and meaningful labels are dynamic. In Twitter-based applications in particular, ensembles are often applied to problems that fit this description, for example sentiment analysis or adapting to drifting circumstances. While it can be straightforward to request different classifiers' input on such ensembles, our goal is to boost dynamic ensembles by combining performance metrics as efficiently as possible. We present a twofold performance-based framework to classify incoming tweets based on recent tweets. On the one hand, individual ensemble classifiers' performance is paramount in defining their contribution to the ensemble. On the other hand, examples are actively selected based on their ability to effectively contribute to the performance in classifying drifting concepts. The main step of the algorithm uses different performance metrics to determine both each classifier strength in the ensemble and each example importance, and hence lifetime, in the learning process. We demonstrate, on a drifted benchmark dataset, that our framework drives the classification performance considerably up for it to make a difference in a variety of applications.
- DOTS: Drift Oriented Tool SystemPublication . Antunes, Mário; Costa, Joana; Silva, Catarina; Bernardete RibeiroDrift is a given in most machine learning applications. The idea that models must accommodate for changes, and thus be dynamic, is ubiquitous. Current challenges include temporal data streams, drift and non-stationary scenarios, often with text data, whether in social networks or in business systems. There are multiple drift patterns types: concepts that appear and disappear suddenly, recurrently, or even gradually or incrementally. Researchers strive to propose and test algorithms and techniques to deal with drift in text classification, but it is difficult to find adequate benchmarks in such dynamic environments. In this paper we present DOTS, Drift Oriented Tool System, a framework that allows for the definition and generation of text-based datasets where drift characteristics can be thoroughly defined, implemented and tested. The usefulness of DOTS is presented using a Twitter stream case study. DOTS is used to define datasets and test the effectiveness of using different document representation in a Twitter scenario. Results show the potential of DOTS in machine learning research.
- On using crowdsourcing and active learning to improve classification performancePublication . Costa, Joana; Silva, Catarina; Antunes, Mário; Ribeiro, BernardeteCrowdsourcing is an emergent trend for general-purpose classification problem solving. Over the past decade, this notion has been embodied by enlisting a crowd of humans to help solve problems. There are a growing number of real-world problems that take advantage of this technique, such as Wikipedia, Linux or Amazon Mechanical Turk. In this paper, we evaluate its suitability for classification, namely if it can outperform state-of-the-art models by combining it with active learning techniques. We propose two approaches based on crowdsourcing and active learning and empirically evaluate the performance of a baseline Support Vector Machine when active learning examples are chosen and made available for classification to a crowd in a web-based scenario. The proposed crowdsourcing active learning approach was tested with Jester data set, a text humour classification benchmark, resulting in promising improvements over baseline results.
