5,684 results for Conference item

  • Mining Arbitrarily Large Datasets Using Heuristic k-Nearest Neighbour Search

    Wu, Xing; Holmes, Geoffrey; Pfahringer, Bernhard (2008)

    Conference item
    University of Waikato

    Nearest Neighbour Search (NNS) is one of the top ten data mining algorithms. It is simple and effective but has a time complexity that is the product of the number of instances and the number of dimensions. When the number of dimensions is greater than two there are no known solutions that can guarantee a sublinear retrieval time. This paper describes and evaluates two ways to make NNS efficient for datasets that are arbitrarily large in the number of instances and dimensions. The methods are best described as heuristic as they are neither exact nor approximate. Both stem from recent developments in the field of data stream classification. The first uses Hoeffding Trees, an extension of decision trees to streams and the second is a direct stream extension of NNS. The methods are evaluated in terms of their accuracy and the time taken to find the neighbours. Results show that the methods are competitive with NNS in terms of accuracy but significantly faster.

    View record details
  • Tie-breaking in Hoeffding trees

    Holmes, Geoffrey; Richard, Kirkby; Pfahringer, Bernhard (2005)

    Conference item
    University of Waikato

    A thorough examination of the performance of Hoeffding trees, state-of-the-art in classification for data streams, on a range of datasets reveals that tie breaking, an essential but supposedly rare procedure, is employed much more than expected. Testing with a lightweight method for handling continuous attributes, we find that the excessive invocation of tie breaking causes performance to degrade significantly on complex and noisy data. Investigating ways to reduce the number of tie breaks, we propose an adaptive method that overcomes the problem while not significantly affecting performance on simpler datasets.

    View record details
  • Data mining challenge problems: any lessons learned?

    Pfahringer, Bernhard (2002)

    Conference item
    University of Waikato

    When considering the merit of data mining challenges, we need to answer the question of whether the amount of academic outcome justifies the related expense of scarce research time. In this paper I will provide anecdotal evidence for what I have learned personally from participating a various challenges. Building on that I will suggest a format for challenges that tries to increase the amount and improve the quality of the academic output of such challenges for the mahine learning research community.

    View record details
  • Organizing the World’s Machine Learning Information

    Vanschoren, Joaquin; Blockeel, Hendrik; Pfahringer, Bernhard; Holmes, Geoffrey (2009)

    Conference item
    University of Waikato

    All around the globe, thousands of learning experiments are being executed on a daily basis, only to be discarded after interpretation. Yet, the information contained in these experiments might have uses beyond their original intent and, if properly stored, could be of great use to future research. In this paper, we hope to stimulate the development of such learning experiment repositories by providing a bird’s-eye view of how they can be created and used in practice, bringing together existing approaches and new ideas. We draw parallels between how experiments are being curated in other sciences, and consecutively discuss how both the empirical and theoretical details of learning experiments can be expressed, organized and made universally accessible. Finally, we discuss a range of possible services such a resource can offer, either used directly or integrated into data mining tools.

    View record details
  • A semi-supervised spam mail detector

    Pfahringer, Bernhard (2006)

    Conference item
    University of Waikato

    This document describes a novel semi-supervised approach to spam classification, which was successful at the ECML/PKDD 2006 spam classification challenge. A local learning method based on lazy projections was successfully combined with a variant of a standard semi-supervised learning algorithm.

    View record details
  • Stress- testing Hoeffding trees

    Holmes, Geoffrey; Kirkby, Richard Brendon; Pfahringer, Bernhard (2005)

    Conference item
    University of Waikato

    Hoeffding trees are state-of-the-art in classification for data streams. They perform prediction by choosing the majority class at each leaf. Their predictive accuracy can be increased by adding Naive Bayes models at the leaves of the trees. By stress-testing these two prediction methods using noise and more complex concepts and an order of magnitude more instances than in previous studies, we discover situations where the Naive Bayes method outperforms the standard Hoeffding tree initially but is eventually overtaken. The reason for this crossover is determined and a hybrid adaptive method is proposed that generally outperforms the two original prediction methods for both simple and complex concepts as well as under noise.

    View record details
  • Learning to use operational advice

    Fürnkranz, Johannes; Pfahringer, Bernhard; Kaindl, Hermann; Kramer, Stefan (2000)

    Conference item
    University of Waikato

    We address the problem of advice-taking in a given domain, in particular for building a game-playing program. Our approach to solving it strives for the application of machine learning techniques throughout, i.e. for avoiding knowledge elicitation by any other means as much as possible. In particular, we build upon existing work on the operationalization of advice by machine and assume that advice is already available in operational form. The relative importance of this advice is, however, not yet known can therefore not be utilized well by a program. This paper presents an approach to determine the relative importance for a given situation through reinforcement learning. We implemented this approach for the game of Hearts and gathered some empirical evidence on its usefulness through experiments. The results show that the programs built according to our approach learned to make good use of the given operational advice.

    View record details
  • Clustering large datasets using cobweb and K-means in tandem

    Li, Mi; Holmes, Geoffrey; Pfahringer, Bernhard (2005)

    Conference item
    University of Waikato

    This paper presents a single scan algorithm for clustering large datasets based on a two phase process which combines two well known clustering methods. The Cobweb algorithm is modified to produce a balanced tree with subclusters at the leaves, and then K-means is applied to the resulting subclusters. The resulting method, Scalable Cobweb, is then compared to a single pass K-means algorithm and standard K-means. The evaluation looks at error as measured by the sum of squared error and vulnerability to the order in which data points are processed.

    View record details
  • Text categorisation using document profiling

    Sauban, Maximilien; Pfahringer, Bernhard (2003)

    Conference item
    University of Waikato

    This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.

    View record details
  • (The Futility of) Trying to Predict Carcinogenicity of Chemical Compounds

    Pfahringer, Bernhard (2001)

    Conference item
    University of Waikato

    This paper describes my submission to one of the sub-problems formulated for the Predictive Toxicology Challenge 2001. The challenge is to predict the carcinogenicity of chemicals based on structural information only. I have only tackled such predictions for bioassays involving male rats. As we currently do not know the true predictions for the test-set, all we can say is that one of the models supplied by us seems to be optimal over some subrange of the ROC spectrum. The successful model uses a voting approach based on most of the sets of structural features made available by various other contestants as well as the organizers in an earlier phase of the Challenge. The WEKA Machine Learning workbench served as the core learning utility. Based on a preliminary examination of our submission we conclude that reliable prediction of carcinogenicity is still a far away goal.

    View record details
  • Wrapping boosters against noise

    Pfahringer, Bernhard; Holmes, Geoffrey; Schmidberger, Gabi (2001)

    Conference item
    University of Waikato

    Wrappers have recently been used to obtain parameter optimizations for learning algorithms. In this paper we investigate the use of a wrapper for estimating the correct number of boosting ensembles in the presence of class noise. Contrary to the naive approach that would be quadratic in the number of boosting iterations, the incremental algorithm described is linear. Additionally, directly using the k-sized ensembles generated during k-fold cross-validation search for prediction usually results in further improvements in classification performance. This improvement can be attributed to the reduction of variance due to averaging k ensembles instead of using only one ensemble. Consequently, cross-validation in the way we use it here, termed wrapping, can be viewed as yet another ensemble learner similar in spirit to bagging but also somewhat related to stacking.

    View record details
  • Exploiting propositionalization based on random relational rules for semi-supervised learning

    Pfahringer, Bernhard; Anderson, Grant (2008)

    Conference item
    University of Waikato

    In this paper we investigate an approach to semi-supervised learning based on randomized propositionalization, which allows for applying standard propositional classification algorithms like support vector machines to multi-relational data. Randomization based on random relational rules can work both with and without a class attribute and can therefore be applied simultaneously to both the labeled and the unlabeled portion of the data present in semi-supervised learning. An empirical investigation compares semi-supervised propositionalization to standard propositionalization using just the labeled data portion, as well as to a variant that also just uses the labeled data portion but includes the label information in an attempt to improve the resulting propositionalization. Preliminary experimental results indicate that propositionalization generated on the full dataset, i.e. the semi- supervised approach, tends to outperform the other two more standard approaches.

    View record details
  • A Toolbox for Learning from Relational Data with Propositional and Multi-instance Learners

    Reutemann, Peter; Pfahringer, Bernhard; Frank, Eibe (2005)

    Conference item
    University of Waikato

    Most databases employ the relational model for data storage. To use this data in a propositional learner, a propositionalization step has to take place. Similarly, the data has to be transformed to be amenable to a multi-instance learner. The Proper Toolbox contains an extended version of RELAGGS, the Multi-Instance Learning Kit MILK, and can also combine the multi-instance data with aggregated data from RELAGGS. RELAGGS was extended to handle arbitrarily nested relations and to work with both primary keys and indices. For MILK the relational model is flattened into a single table and this data is fed into a multi-instance learner. REMILK finally combines the aggregated data produced by RELAGGS and the multi-instance data, flattened for MILK, into a single table that is once again the input for a multi-instance learner. Several well-known datasets are used for experiments which highlight the strengths and weaknesses of the different approaches.

    View record details
  • Using the online cross-entropy method to learn relational policies for playing different games

    Sarjant, Samuel; Pfahringer, Bernhard; Driessens, Kurt; Smith, Tony C. (2011)

    Conference item
    University of Waikato

    By defining a video-game environment as a collection of objects, relations, actions and rewards, the relational reinforcement learning algorithm presented in this paper generates and optimises a set of concise, human-readable relational rules for achieving maximal reward. Rule learning is achieved using a combination of incremental specialisation of rules and a modified online cross-entropy method, which dynamically adjusts the rate of learning as the agent progresses. The algorithm is tested on the Ms. Pac-Man and Mario environments, with results indicating the agent learns an effective policy for acting within each environment.

    View record details
  • Clustering Relational Data Based on Randomized Propositionalization

    Anderson, Grant; Pfahringer, Bernhard (2008)

    Conference item
    University of Waikato

    Clustering of relational data has so far received a lot less attention than classification of such data. In this paper we investigate a simple approach based on randomized propositionalization, which allows for applying standard clustering algorithms like KMeans to multi-relational data. We describe how random rules are generated and then turned into Boolean-valued features. Clustering generally is not straightforward to evaluate, but preliminary experimental results on a number of standard ILP datasets show promising results. Clusters generated without class information usually agree well with the true class labels of cluster members, i.e. class distributions inside clusters generally differ significantly from the global class distributions. The two-tiered algorithm described shows good scalability due to the randomized nature of the first step and the availability of efficient propositional clustering algorithms for the second step.

    View record details
  • Propositionalization through stochastic discrimination

    Pfahringer, Bernhard; Holmes, Geoffrey (2003)

    Conference item
    University of Waikato

    A Simple algorithm base on the theory of stochastic discrimination is developed for the fast extraction of sub-graphs with potential discriminative power from a given set of pre-classified graphs. A preliminary experimental evaluation indicates the potential of the approach. Limitations are discussed as well as directions for future research.

    View record details
  • Using weighted nearest neighbor to benefit from unlabeled data

    Driessens, Kurt; Reutemann, Peter; Pfahringer, Bernhard; Leschi, Claire (2006)

    Conference item
    University of Waikato

    The development of data-mining applications such as textclassification and molecular profiling has shown the need for machine learning algorithms that can benefit from both labeled and unlabeled data, where often the unlabeled examples greatly outnumber the labeled examples. In this paper we present a two-stage classifier that improves its predictive accuracy by making use of the available unlabeled data. It uses a weighted nearest neighbor classification algorithm using the combined example-sets as a knowledge base. The examples from the unlabeled set are “pre-labeled” by an initial classifier that is build using the limited available training data. By choosing appropriate weights for this pre-labeled data, the nearest neighbor classifier consistently improves on the original classifier.

    View record details
  • Propositionalisation of multiple sequence alignments using probabilistic models

    Mutter, Stefan; Pfahringer, Bernhard; Holmes, Geoffrey (2008)

    Conference item
    University of Waikato

    Multiple sequence alignments play a central role in Bioinformatics. Most alignment representations are designed to facilitate knowledge extraction by human experts. Additionally statistical models like Profile Hidden Markov Models are used as representations. They offer the advantage to provide sound, probabilistic scores. The basic idea we present in this paper is to use the structure of a Profile Hidden Markov Model for propositionalisation. This way we get a simple, extendable representation of multiple sequence alignments which facilitates further analysis by Machine Learning algorighms.

    View record details
  • SMOTE for regression

    Torgo, Luís; Ribeiro, Rita P.; Pfahringer, Bernhard; Branco, Paula (2013)

    Conference item
    University of Waikato

    Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by sampling approaches. These approaches change the distribution of the given training data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present a modification of the well-known Smote algorithm that allows its use on these regression tasks. In an extensive set of experiments we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed SmoteR method can be used with any existing regression algorithm turning it into a general tool for addressing problems of forecasting rare extreme values of a continuous target variable.

    View record details
  • Pitfalls in benchmarking data stream classification and how to avoid them

    Bifet, Albert; Read, Jesse; Žliobaitė, Indrė; Pfahringer, Bernhard; Holmes, Geoffrey (2013)

    Conference item
    University of Waikato

    Data stream classification plays an important role in modern data analysis, where data arrives in a stream and needs to be mined in real time. In the data stream setting the underlying distribution from which this data comes may be changing and evolving, and so classifiers that can update themselves during operation are becoming the state-of-the-art. In this paper we show that data streams may have an important temporal component, which currently is not considered in the evaluation and benchmarking of data stream classifiers. We demonstrate how a naive classifier considering the temporal component only outperforms a lot of current state-of-the-art classifiers on real data streams that have temporal dependence, i.e. data is autocorrelated. We propose to evaluate data stream classifiers taking into account temporal dependence, and introduce a new evaluation measure, which provides a more accurate gauge of data stream classifier performance. In response to the temporal dependence issue we propose a generic wrapper for data stream classifiers, which incorporates the temporal component into the attribute space.

    View record details