70 results for Pfahringer, Bernhard, Conference item

  • Random Relational Rules

    Pfahringer, Bernhard; Anderson, Grant (2006)

    Conference item
    University of Waikato

    Exhaustive search in relational learning is generally infeasible, therefore some form of heuristic search is usually employed, such as in FOIL[1]. On the other hand, so-called stochastic discrimination provides a framework for combining arbitrary numbers of weak classifiers (in this case randomly generated relational rules) in a way where accuracy improves with additional rules, even after maximal accuracy on the training data has been reached. [2] The weak classifiers must have a slightly higher probability of covering instances of their target class than of other classes. As the rules are also independent and identically distributed, the Central Limit theorem applies and as the number of weak classifiers/rules grows, coverages for different classes resemble well-separated normal distributions. Stochastic discrimination is closely related to other ensemble methods like Bagging, Boosting, or Random forests, all of which have been tried in relational learning [3, 4, 5].

    View record details
  • Bagging ensemble selection for regression

    Sun, Quan; Pfahringer, Bernhard (2012)

    Conference item
    University of Waikato

    Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non negative least squares algorithm is a viable approach for pruning an ensemble of ensembles.

    View record details
  • A discriminative approach to structured biological data

    Mutter, Stefan; Pfahringer, Bernhard (2007)

    Conference item
    University of Waikato

    This paper introduces the first author’s PhD project which has just got out of its initial stage. Biological sequence data is, on the one hand, highly structured. On the other hand there are large amounts of unlabelled data. Thus we combine probabilistic graphical models and semi-supervised learning. The former to handle structured data and latter to deal with unlabelled data. We apply our models to genotype-phenotype modelling problems. In particular we predict the set of Single Nucleotide Polymorphisms which underlie a specific phenotypical trait.

    View record details
  • Learning Distance Metrics for Multi-Label Classification

    Gouk, Henry; Pfahringer, Bernhard; Cree, Michael J. (2016)

    Conference item
    University of Waikato

    Distance metric learning is a well studied problem in the field of machine learning, where it is typically used to improve the accuracy of instance based learning techniques. In this paper we propose a distance metric learning algorithm that is specialised for multi-label classification tasks, rather than the multiclass setting considered by most work in this area. The method trains an embedder that can transform instances into a feature space where squared Euclidean distance provides an estimate of the Jaccard distance between the corresponding label vectors. In addition to a linear Mahalanobis style metric, we also present a nonlinear extension that provides a substantial boost in performance. We show that this technique significantly improves upon current approaches for instance based multi-label classification, and also enables interesting data visualisations.

    View record details
  • Handling numeric attributes in Hoeffding trees

    Pfahringer, Bernhard; Holmes, Geoffrey; Kirkby, Richard Brendon (2008)

    Conference item
    University of Waikato

    For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into pre-defined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multi-class tree-based classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.

    View record details
  • Mining Arbitrarily Large Datasets Using Heuristic k-Nearest Neighbour Search

    Wu, Xing; Holmes, Geoffrey; Pfahringer, Bernhard (2008)

    Conference item
    University of Waikato

    Nearest Neighbour Search (NNS) is one of the top ten data mining algorithms. It is simple and effective but has a time complexity that is the product of the number of instances and the number of dimensions. When the number of dimensions is greater than two there are no known solutions that can guarantee a sublinear retrieval time. This paper describes and evaluates two ways to make NNS efficient for datasets that are arbitrarily large in the number of instances and dimensions. The methods are best described as heuristic as they are neither exact nor approximate. Both stem from recent developments in the field of data stream classification. The first uses Hoeffding Trees, an extension of decision trees to streams and the second is a direct stream extension of NNS. The methods are evaluated in terms of their accuracy and the time taken to find the neighbours. Results show that the methods are competitive with NNS in terms of accuracy but significantly faster.

    View record details
  • A semi-supervised spam mail detector

    Pfahringer, Bernhard (2006)

    Conference item
    University of Waikato

    This document describes a novel semi-supervised approach to spam classification, which was successful at the ECML/PKDD 2006 spam classification challenge. A local learning method based on lazy projections was successfully combined with a variant of a standard semi-supervised learning algorithm.

    View record details
  • Tie-breaking in Hoeffding trees

    Holmes, Geoffrey; Richard, Kirkby; Pfahringer, Bernhard (2005)

    Conference item
    University of Waikato

    A thorough examination of the performance of Hoeffding trees, state-of-the-art in classification for data streams, on a range of datasets reveals that tie breaking, an essential but supposedly rare procedure, is employed much more than expected. Testing with a lightweight method for handling continuous attributes, we find that the excessive invocation of tie breaking causes performance to degrade significantly on complex and noisy data. Investigating ways to reduce the number of tie breaks, we propose an adaptive method that overcomes the problem while not significantly affecting performance on simpler datasets.

    View record details
  • Text categorisation using document profiling

    Sauban, Maximilien; Pfahringer, Bernhard (2003)

    Conference item
    University of Waikato

    This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.

    View record details
  • Data mining challenge problems: any lessons learned?

    Pfahringer, Bernhard (2002)

    Conference item
    University of Waikato

    When considering the merit of data mining challenges, we need to answer the question of whether the amount of academic outcome justifies the related expense of scarce research time. In this paper I will provide anecdotal evidence for what I have learned personally from participating a various challenges. Building on that I will suggest a format for challenges that tries to increase the amount and improve the quality of the academic output of such challenges for the mahine learning research community.

    View record details
  • The positive effects of negative information: Extending one-class classification models in binary proteomic sequence classification

    Mutter, Stefan; Pfahringer, Bernhard; Holmes, Geoffrey (2009)

    Conference item
    University of Waikato

    Profile Hidden Markov Models (PHMMs) have been widely used as models for Multiple Sequence Alignments. By their nature, they are generative one-class classifiers trained only on sequences belonging to the target class they represent. Nevertheless, they are often used to discriminate between classes. In this paper, we investigate the beneficial effects of information from non-target classes in discriminative tasks. Firstly, the traditional PHMM is extended to a new binary classifier. Secondly, we propose propositional representations of the original PHMM that capture information from target and non-target sequences and can be used with standard binary classifiers. Since PHMM training is time intensive, we investigate whether our approach allows the training of the PHMM to stop, before it is fully converged, without loss of predictive power.

    View record details
  • MOA: Massive Online Analysis, a framework for stream classification and clustering.

    Bifet, Albert; Holmes, Geoffrey; Pfahringer, Bernhard; Kranen, Philipp; Kremer, Hardy; Jansen, Timm; Seidl, Thomas (2010)

    Conference item
    University of Waikato

    Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problem of scaling up the implementation of state of the art algorithms to real world dataset sizes. It contains collection of offline and online for both classification and clustering as well as tools for evaluation. In particular, for classification it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. For clustering, it implements StreamKM++, CluStream, ClusTree, Den-Stream, D-Stream and CobWeb. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily apply and compare several algorithms to real world data set and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.

    View record details
  • Clustering for classification

    Evans, Reuben James Emmanuel; Pfahringer, Bernhard; Holmes, Geoffrey (2011)

    Conference item
    University of Waikato

    Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This paper proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification problems. Results show that clustering prior to classification is beneficial when employing a sophisticated classifier however when the classifier is simple the benefits over random selection are not justified given the added cost of clustering. The results also show that for each dataset it is important to choose a clustering method carefully.

    View record details
  • Prediction of ordinal classes using regression trees

    Kramer, Stefan; Widmer, Gerhard; Pfahringer, Bernhard; de Groeve, Michael (2000)

    Conference item
    University of Waikato

    This paper is devoted to the problem of learning to predict ordinal (i.e., ordered discrete) classes using classification and regression trees. We start with S-CART, a tree induction algorithm, and study various ways of transforming it into a learner for ordinal classification tasks. These algorithm variants are compared on a number of benchmark data sets to verify the relative strengths and weaknesses of the strategies and to study the trade-off between optimal categorical classification accuracy (hit rate) and minimum distance-based error. Preliminary results indicate that this is a promising avenue towards algorithms that combine aspects of classification and regression.

    View record details
  • Learning to use operational advice

    Fürnkranz, Johannes; Pfahringer, Bernhard; Kaindl, Hermann; Kramer, Stefan (2000)

    Conference item
    University of Waikato

    We address the problem of advice-taking in a given domain, in particular for building a game-playing program. Our approach to solving it strives for the application of machine learning techniques throughout, i.e. for avoiding knowledge elicitation by any other means as much as possible. In particular, we build upon existing work on the operationalization of advice by machine and assume that advice is already available in operational form. The relative importance of this advice is, however, not yet known can therefore not be utilized well by a program. This paper presents an approach to determine the relative importance for a given situation through reinforcement learning. We implemented this approach for the game of Hearts and gathered some empirical evidence on its usefulness through experiments. The results show that the programs built according to our approach learned to make good use of the given operational advice.

    View record details
  • Exploiting propositionalization based on random relational rules for semi-supervised learning

    Pfahringer, Bernhard; Anderson, Grant (2008)

    Conference item
    University of Waikato

    In this paper we investigate an approach to semi-supervised learning based on randomized propositionalization, which allows for applying standard propositional classification algorithms like support vector machines to multi-relational data. Randomization based on random relational rules can work both with and without a class attribute and can therefore be applied simultaneously to both the labeled and the unlabeled portion of the data present in semi-supervised learning. An empirical investigation compares semi-supervised propositionalization to standard propositionalization using just the labeled data portion, as well as to a variant that also just uses the labeled data portion but includes the label information in an attempt to improve the resulting propositionalization. Preliminary experimental results indicate that propositionalization generated on the full dataset, i.e. the semi- supervised approach, tends to outperform the other two more standard approaches.

    View record details
  • Improving on bagging with input smearing

    Frank, Eibe; Pfahringer, Bernhard (2006)

    Conference item
    University of Waikato

    Bagging is an ensemble learning method that has proved to be a useful tool in the arsenal of machine learning practitioners. Commonly applied in conjunction with decision tree learners to build an ensemble of decision trees, it often leads to reduced errors in the predictions when compared to using a single tree. A single tree is built from a training set of size N. Bagging is based on the idea that, ideally, we would like to eliminate the variance due to a particular training set by combining trees built from all training sets of size N. However, in practice, only one training set is available, and bagging simulates this platonic method by sampling with replacement from the original training data to form new training sets. In this paper we pursue the idea of sampling from a kernel density estimator of the underlying distribution to form new training sets, in addition to sampling from the data itself. This can be viewed as “smearing out” the resampled training data to generate new datasets, and the amount of “smear” is controlled by a parameter. We show that the resulting method, called “input smearing”, can lead to improved results when compared to bagging. We present results for both classification and regression problems.

    View record details
  • Scaling up semi-supervised learning: An efficient and effective LLGC variant

    Pfahringer, Bernhard; Leschi, Claire; Reutemann, Peter (2007)

    Conference item
    University of Waikato

    Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

    View record details
  • A novel two stage scheme utilizing the test set for model selection in text classification

    Pfahringer, Bernhard; Reutemann, Peter; Mayo, Michael (2005)

    Conference item
    University of Waikato

    Text classification is a natural application domain for semi-supervised learning, as labeling documents is expensive, but on the other hand usually an abundance of unlabeled documents is available. We describe a novel simple two stage scheme based on dagging which allows for utilizing the test set in model selection. The dagging ensemble can also be used by itself instead of the original classifier. We evaluate the performance of a meta classifier choosing between various base learners and their respective dagging ensembles. The selection process seems to perform robustly especially for small percentages of available labels for training.

    View record details
  • Millions of random rules

    Pfahringer, Bernhard; Holmes, Geoffrey; Wang, Cheng (2004)

    Conference item
    University of Waikato

    In this paper we report on work in progress based on the induction of vast numbers of almost random rules. This work tries to combine and explore ideas from both Random Forests as well as Stochastic Discrimination. We describe a fast algorithm for generating almost random rules and study its performance. Rules are generated in such a way that all training examples are covered roughly by the same number of rules each. Rules themselves usually have a clear majority class among the examples they cover, but they are not limited in terms of either minimal coverage, nor minimal purity. A preliminary experimental evaluation indicates really promising results for both predictive accuracy as well as speed of induction, but at the expense of both large memory consumption as well as slow prediction. Finally, we discuss various directions for our future research.

    View record details