73 results for Pfahringer, Bernhard, Conference item

  • Random Relational Rules

    Pfahringer, Bernhard; Anderson, Grant (2006)

    Conference item
    University of Waikato

    Exhaustive search in relational learning is generally infeasible, therefore some form of heuristic search is usually employed, such as in FOIL[1]. On the other hand, so-called stochastic discrimination provides a framework for combining arbitrary numbers of weak classifiers (in this case randomly generated relational rules) in a way where accuracy improves with additional rules, even after maximal accuracy on the training data has been reached. [2] The weak classifiers must have a slightly higher probability of covering instances of their target class than of other classes. As the rules are also independent and identically distributed, the Central Limit theorem applies and as the number of weak classifiers/rules grows, coverages for different classes resemble well-separated normal distributions. Stochastic discrimination is closely related to other ensemble methods like Bagging, Boosting, or Random forests, all of which have been tried in relational learning [3, 4, 5].

    View record details
  • Bagging ensemble selection for regression

    Sun, Quan; Pfahringer, Bernhard (2012)

    Conference item
    University of Waikato

    Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non negative least squares algorithm is a viable approach for pruning an ensemble of ensembles.

    View record details
  • Learning Distance Metrics for Multi-Label Classification

    Gouk, Henry; Pfahringer, Bernhard; Cree, Michael J. (2016)

    Conference item
    University of Waikato

    Distance metric learning is a well studied problem in the field of machine learning, where it is typically used to improve the accuracy of instance based learning techniques. In this paper we propose a distance metric learning algorithm that is specialised for multi-label classification tasks, rather than the multiclass setting considered by most work in this area. The method trains an embedder that can transform instances into a feature space where squared Euclidean distance provides an estimate of the Jaccard distance between the corresponding label vectors. In addition to a linear Mahalanobis style metric, we also present a nonlinear extension that provides a substantial boost in performance. We show that this technique significantly improves upon current approaches for instance based multi-label classification, and also enables interesting data visualisations.

    View record details
  • Experiments in Predicting Biodegradability

    Blockeel, Hendrik; Džeroski, Sašo; Kompare, Boris; Kramer, Stefan; Pfahringer, Bernhard; Van Laer, Wim (2004)

    Conference item
    University of Waikato

    This paper is concerned with the use of AI techniques in ecology. More specifically, we present a novel application of inductive logic programming (ILP) in the area of quantitative structure-activity relationships (QSARs). The activity we want to predict is the biodegradability of chemical compounds in water. In particular, the target variable is the half-life for aerobic aqueous biodegradation. Structural descriptions of chemicals in terms of atoms and bonds are derived from the chemicals' SMILES encodings. The definition of substructures is used as background knowledge. Predicting biodegradability is essentially a regression problem, but we also consider a discretized version of the target variable. We thus employ a number of relational classification and regression methods on the relational representation and compare these to propositional methods applied to different propositionalizations of the problem. We also experiment with a prediction technique that consists of merging upper and lower bound predictions into one prediction. Some conclusions are drawn concerning the applicability of machine learning systems and the merging technique in this domain and the evaluation of hypotheses.

    View record details
  • A semi-supervised spam mail detector

    Pfahringer, Bernhard (2006)

    Conference item
    University of Waikato

    This document describes a novel semi-supervised approach to spam classification, which was successful at the ECML/PKDD 2006 spam classification challenge. A local learning method based on lazy projections was successfully combined with a variant of a standard semi-supervised learning algorithm.

    View record details
  • Tie-breaking in Hoeffding trees

    Holmes, Geoffrey; Richard, Kirkby; Pfahringer, Bernhard (2005)

    Conference item
    University of Waikato

    A thorough examination of the performance of Hoeffding trees, state-of-the-art in classification for data streams, on a range of datasets reveals that tie breaking, an essential but supposedly rare procedure, is employed much more than expected. Testing with a lightweight method for handling continuous attributes, we find that the excessive invocation of tie breaking causes performance to degrade significantly on complex and noisy data. Investigating ways to reduce the number of tie breaks, we propose an adaptive method that overcomes the problem while not significantly affecting performance on simpler datasets.

    View record details
  • Text categorisation using document profiling

    Sauban, Maximilien; Pfahringer, Bernhard (2003)

    Conference item
    University of Waikato

    This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.

    View record details
  • Data mining challenge problems: any lessons learned?

    Pfahringer, Bernhard (2002)

    Conference item
    University of Waikato

    When considering the merit of data mining challenges, we need to answer the question of whether the amount of academic outcome justifies the related expense of scarce research time. In this paper I will provide anecdotal evidence for what I have learned personally from participating a various challenges. Building on that I will suggest a format for challenges that tries to increase the amount and improve the quality of the academic output of such challenges for the mahine learning research community.

    View record details
  • The positive effects of negative information: Extending one-class classification models in binary proteomic sequence classification

    Mutter, Stefan; Pfahringer, Bernhard; Holmes, Geoffrey (2009)

    Conference item
    University of Waikato

    Profile Hidden Markov Models (PHMMs) have been widely used as models for Multiple Sequence Alignments. By their nature, they are generative one-class classifiers trained only on sequences belonging to the target class they represent. Nevertheless, they are often used to discriminate between classes. In this paper, we investigate the beneficial effects of information from non-target classes in discriminative tasks. Firstly, the traditional PHMM is extended to a new binary classifier. Secondly, we propose propositional representations of the original PHMM that capture information from target and non-target sequences and can be used with standard binary classifiers. Since PHMM training is time intensive, we investigate whether our approach allows the training of the PHMM to stop, before it is fully converged, without loss of predictive power.

    View record details
  • Prediction of ordinal classes using regression trees

    Kramer, Stefan; Widmer, Gerhard; Pfahringer, Bernhard; de Groeve, Michael (2000)

    Conference item
    University of Waikato

    This paper is devoted to the problem of learning to predict ordinal (i.e., ordered discrete) classes using classification and regression trees. We start with S-CART, a tree induction algorithm, and study various ways of transforming it into a learner for ordinal classification tasks. These algorithm variants are compared on a number of benchmark data sets to verify the relative strengths and weaknesses of the strategies and to study the trade-off between optimal categorical classification accuracy (hit rate) and minimum distance-based error. Preliminary results indicate that this is a promising avenue towards algorithms that combine aspects of classification and regression.

    View record details
  • Learning to use operational advice

    Fürnkranz, Johannes; Pfahringer, Bernhard; Kaindl, Hermann; Kramer, Stefan (2000)

    Conference item
    University of Waikato

    We address the problem of advice-taking in a given domain, in particular for building a game-playing program. Our approach to solving it strives for the application of machine learning techniques throughout, i.e. for avoiding knowledge elicitation by any other means as much as possible. In particular, we build upon existing work on the operationalization of advice by machine and assume that advice is already available in operational form. The relative importance of this advice is, however, not yet known can therefore not be utilized well by a program. This paper presents an approach to determine the relative importance for a given situation through reinforcement learning. We implemented this approach for the game of Hearts and gathered some empirical evidence on its usefulness through experiments. The results show that the programs built according to our approach learned to make good use of the given operational advice.

    View record details
  • Exploiting propositionalization based on random relational rules for semi-supervised learning

    Pfahringer, Bernhard; Anderson, Grant (2008)

    Conference item
    University of Waikato

    In this paper we investigate an approach to semi-supervised learning based on randomized propositionalization, which allows for applying standard propositional classification algorithms like support vector machines to multi-relational data. Randomization based on random relational rules can work both with and without a class attribute and can therefore be applied simultaneously to both the labeled and the unlabeled portion of the data present in semi-supervised learning. An empirical investigation compares semi-supervised propositionalization to standard propositionalization using just the labeled data portion, as well as to a variant that also just uses the labeled data portion but includes the label information in an attempt to improve the resulting propositionalization. Preliminary experimental results indicate that propositionalization generated on the full dataset, i.e. the semi- supervised approach, tends to outperform the other two more standard approaches.

    View record details
  • Scaling up semi-supervised learning: An efficient and effective LLGC variant

    Pfahringer, Bernhard; Leschi, Claire; Reutemann, Peter (2007)

    Conference item
    University of Waikato

    Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

    View record details
  • A novel two stage scheme utilizing the test set for model selection in text classification

    Pfahringer, Bernhard; Reutemann, Peter; Mayo, Michael (2005)

    Conference item
    University of Waikato

    Text classification is a natural application domain for semi-supervised learning, as labeling documents is expensive, but on the other hand usually an abundance of unlabeled documents is available. We describe a novel simple two stage scheme based on dagging which allows for utilizing the test set in model selection. The dagging ensemble can also be used by itself instead of the original classifier. We evaluate the performance of a meta classifier choosing between various base learners and their respective dagging ensembles. The selection process seems to perform robustly especially for small percentages of available labels for training.

    View record details
  • Millions of random rules

    Pfahringer, Bernhard; Holmes, Geoffrey; Wang, Cheng (2004)

    Conference item
    University of Waikato

    In this paper we report on work in progress based on the induction of vast numbers of almost random rules. This work tries to combine and explore ideas from both Random Forests as well as Stochastic Discrimination. We describe a fast algorithm for generating almost random rules and study its performance. Rules are generated in such a way that all training examples are covered roughly by the same number of rules each. Rules themselves usually have a clear majority class among the examples they cover, but they are not limited in terms of either minimal coverage, nor minimal purity. A preliminary experimental evaluation indicates really promising results for both predictive accuracy as well as speed of induction, but at the expense of both large memory consumption as well as slow prediction. Finally, we discuss various directions for our future research.

    View record details
  • Propositionalization through stochastic discrimination

    Pfahringer, Bernhard; Holmes, Geoffrey (2003)

    Conference item
    University of Waikato

    A Simple algorithm base on the theory of stochastic discrimination is developed for the fast extraction of sub-graphs with potential discriminative power from a given set of pre-classified graphs. A preliminary experimental evaluation indicates the potential of the approach. Limitations are discussed as well as directions for future research.

    View record details
  • (The Futility of) Trying to Predict Carcinogenicity of Chemical Compounds

    Pfahringer, Bernhard (2001)

    Conference item
    University of Waikato

    This paper describes my submission to one of the sub-problems formulated for the Predictive Toxicology Challenge 2001. The challenge is to predict the carcinogenicity of chemicals based on structural information only. I have only tackled such predictions for bioassays involving male rats. As we currently do not know the true predictions for the test-set, all we can say is that one of the models supplied by us seems to be optimal over some subrange of the ROC spectrum. The successful model uses a voting approach based on most of the sets of structural features made available by various other contestants as well as the organizers in an earlier phase of the Challenge. The WEKA Machine Learning workbench served as the core learning utility. Based on a preliminary examination of our submission we conclude that reliable prediction of carcinogenicity is still a far away goal.

    View record details
  • Wrapping boosters against noise

    Pfahringer, Bernhard; Holmes, Geoffrey; Schmidberger, Gabi (2001)

    Conference item
    University of Waikato

    Wrappers have recently been used to obtain parameter optimizations for learning algorithms. In this paper we investigate the use of a wrapper for estimating the correct number of boosting ensembles in the presence of class noise. Contrary to the naive approach that would be quadratic in the number of boosting iterations, the incremental algorithm described is linear. Additionally, directly using the k-sized ensembles generated during k-fold cross-validation search for prediction usually results in further improvements in classification performance. This improvement can be attributed to the reduction of variance due to averaging k ensembles instead of using only one ensemble. Consequently, cross-validation in the way we use it here, termed wrapping, can be viewed as yet another ensemble learner similar in spirit to bagging but also somewhat related to stacking.

    View record details
  • Full model selection in the space of data mining operators

    Sun, Quan; Pfahringer, Bernhard; Mayo, Michael (2012)

    Conference item
    University of Waikato

    We propose a framework and a novel algorithm for the full model selection (FMS) problem. The proposed algorithm, combining both genetic algorithms (GA) and particle swarm optimization (PSO), is named GPS (which stands for GAPSO-FMS), in which a GA is used for searching the optimal structure of a data mining solution, and PSO is used for searching the optimal parameter set for a particular structure instance. Given a classification or regression problem, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are applicable to the problem, including data cleansing, data sampling, feature transformation/selection and algorithm operators. The solution can also be represented graphically in a human readable form. Experimental results demonstrate the benefit of the algorithm.

    View record details
  • Semi-random model tree ensembles: An effective and scalable regression method

    Pfahringer, Bernhard (2011)

    Conference item
    University of Waikato

    We present and investigate ensembles of semi-random model trees as a novel regression method. Such ensembles combine the scalability of tree-based methods with predictive performance rivalling the state of the art in numeric prediction. An empirical investigation shows that Semi-Random Model Trees produce predictive performance which is competitive with state-of-the-art methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude.

    View record details