182 results for Witten, Ian H.

  • StoneD: A bridge between Greenstone and DSpace

    Witten, Ian H.; Bainbridge, David; Tansley, Robert; Huang, Chi-Yu; Don, Katherine J. (2005-04)

    Working or discussion paper
    University of Waikato

    Greenstone and DSpace are widely-used software systems for digital libraries, and prospective users sometimes wonder which one to adopt. In fact, the aims of the two are very different, although their domains of application do overlap. This paper describes the systems and identifies their similarities and differences. We also present StoneD, a stone bridge between the production versions of Greenstone and DSpace that allows users of either system to easily migrate to the other, or continue with a combination of both. This bridge eliminates the risk of finding oneself locked in to an inappropriate choice of system. We also discuss other possible opportunities for combining the advantages of the two, to the benefit of the user communities of both systems.

    View record details
  • Displaying 3D images: algorithms for single-image random-dot

    Thimbleby, Harold W.; Inglis, Stuart J.; Witten, Ian H. (1994-10-01)

    Journal article
    University of Waikato

    A new, simple, and symmetric algorithm can be implemented that results in higher levels of detail in solid objects than previously possible with autostereograms. In a stereoscope, an optical instrument similar to binoculars, each eye views a different picture and thereby receives the specific image that would have arisen naturally. An early suggestion for a color stereo computer display involved a rotating filter wheel held in front of the eyes. In contrast, this article describes a method for viewing on paper or on an ordinary computer screen without special equipment, although it is limited to the display of 3D monochromatic objects. (The image can be colored, say, for artistic reasons, but the method we describe does not allow colors to be allocated in a way that corresponds to an arbitrary coloring of the solid object depicted.) The image can easily be constructed by computer from any 3D scene or solid object description.

    View record details
  • Using compression to identify acronyms in text

    Yeates, Stuart Andrew; Bainbridge, David; Witten, Ian H. (2000-01)

    Working or discussion paper
    University of Waikato

    Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that showed how particular kinds of lexical tokens - names, dates, locations, etc. - can be identified and located in running text, using compression models to provide the leverage necessary to distinguish different token types.

    View record details
  • Text categorization using compression models

    Frank, Eibe; Chui, Chang; Witten, Ian H. (2000-01)

    Working or discussion paper
    University of Waikato

    Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles which effectively define the categories are used as “training data” to build a model that can be used for classifying new articles that comprise the “test data”. This contrasts with “unsupervised” learning, where there is no training data and clusters of like documents are sought amongst the test articles. With supervised learning, meaningful labels (such as keyphrases) are attached to the training documents, and appropriate labels can be assigned automatically to test documents depending on which category they fall into.

    View record details
  • KEA: Practical automatic keyphrase extraction

    Witten, Ian H.; Paynter, Gordon W.; Frank, Eibe; Gutwin, Carl; Nevill-Manning, Craig G. (2000-03)

    Working or discussion paper
    University of Waikato

    Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

    View record details
  • Automating iterative tasks with programming by demonstration: a user evaluation

    Paynter, Gordon W.; Witten, Ian H. (1999-05)

    Working or discussion paper
    University of Waikato

    Computer users often face iterative tasks that cannot be automated using the tools and aggregation techniques provided by their application program: they end up performing the iteration by hand, repeating user interface actions over and over again. We have implemented an agent, called Familiar, that can be taught to perform iterative tasks using programming by demonstration (PBD). Unlike other PBD systems, it is domain independent and works with unmodified, widely-used, applications in a popular operating system. In a formal evaluation, we found that users quickly learned to use the agent to automate iterative tasks. Generally, the participants preferred to use multiple selection where possible, but could and did use PBD in situations involving iteration over many commands, or when other techniques were unavailable.

    View record details
  • Weka: Practical machine learning tools and techniques with Java implementations

    Witten, Ian H.; Frank, Eibe; Trigg, Leonard E.; Hall, Mark A.; Holmes, Geoffrey; Cunningham, Sally Jo (1999-08)

    Working or discussion paper
    University of Waikato

    The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.

    View record details
  • Generating accurate rule sets without global optimization

    Frank, Eibe; Witten, Ian H. (1998-01)

    Working or discussion paper
    University of Waikato

    The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER’s. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.

    View record details
  • Browsing in digital libraries: a phrase-based approach

    Nevill-Manning, Craig G.; Witten, Ian H.; Paynter, Gordon W. (1997-01)

    Working or discussion paper
    University of Waikato

    A key question for digital libraries is this: how should one go about becoming familiar with a digital collection, as opposed to a physical one? Digital collections generally present an appearance which is extremely opaque-a screen, typically a Web page, with no indication of what, or how much, lies beyond: whether a carefully-selected collection or a morass of worthless ephemera; whether half a dozen documents or many millions. At least physical collections occupy physical space, present a physical appearance, and exhibit tangible physical organization. When standing on the threshold of a large library one gains a sense of presence and permanence that reflects the care taken in building and maintaining the collection inside. No-one could confuse it with a dung-heap! Yet in the digital world the difference is not so palpable.

    View record details
  • Stacking bagged and dagged models

    Ting, Kai Ming; Witten, Ian H. (1997-03)

    Working or discussion paper
    University of Waikato

    In this paper, we investigate the method of stacked generalization in combining models derived from different subsets of a training dataset by a single learning algorithm, as well as different algorithms. The simplest way to combine predictions from competing models is majority vote, and the effect of the sampling regime used to generate training subsets has already been studied in this context-when bootstrap samples are used the method is called bagging, and for disjoint samples we call it dagging. This paper extends these studies to stacked generalization, where a learning algorithm is employed to combine the models. This yields new methods dubbed bag-stacking and dag-stacking. We demonstrate that bag-stacking and dag-stacking can be effective for classification tasks even when the training samples cover just a small fraction of the full dataset. In contrast to earlier bagging results, we show that bagging and bag-stacking work for stable as well as unstable learning algorithms, as do dagging and dag-stacking. We find that bag-stacking (dag-stacking) almost always has higher predictive accuracy than bagging (dagging), and we also show that bag-stacking models derived using two different algorithms is more effective than conventional bagging.

    View record details
  • Using model trees for classification

    Frank, Eibe; Wang, Yong; Inglis, Stuart J.; Holmes, Geoffrey; Witten, Ian H. (1997-04)

    Working or discussion paper
    University of Waikato

    Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5’, based on Quinlan’s M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.

    View record details
  • Interactive concept learning for end-user applications

    Maulsby, David; Witten, Ian H. (1995-02)

    Working or discussion paper
    University of Waikato

    Personalizable software agents will learn new tasks from their users. This implies being able to learn from instructions users might give: examples, yes/no responses, and ambiguous, incomplete hints. Agents should also exploit background knowledge customized for applications such as drawing, word processing and form-filling. The task models that agents learn describe data, actions and their context. Learning about data from examples and hints is the subject of this paper. The Cima learning system combines evidence from examples, task knowledge and user hints to form Disjunctive Normal Form (DNF) rules for classifying, generating or modifying data. Cima's dynamic bias manager generates candidate features (attribute values, functions or relations), from which its DNF learning algorithm selects relevant features and forms the rules. The algorithm is based on a classic greedy method, with two enhancements. First, the standard learning criterion, correct classification, is augmented with a set of utility and instructional criteria. Utility criteria ensure that descriptions are properly formed for use in actions, whether to classify, search for, generate or modify data. Instructional criteria ensure that descriptions include features that users suggest and avoid those that users reject. The second enhancement is to augment the usual statistical metric for selecting relevant attributes with a set of heuristics, including beliefs based on user suggestions and application-specific background knowledge. Using multiple heuristics increases the justification for selecting features; more important, it helps the learner choose among alternative interpretations of hints. When tested on dialogues observed in a prior user study on a simulated interface agent, the learning algorithm achieves 95% of the learning efficiency standard established in that study.

    View record details
  • A New Zealand digital library for computer science research

    Witten, Ian H.; Cunningham, Sally Jo; Vallabh, Mahendra; Bell, Timothy C. (1995-03)

    Working or discussion paper
    University of Waikato

    A large amount of computing literature has become available over the Internet, as university departments and research institutions have made their technical reports, preprints, and theses available electronically. Access to these items has been limited, however, by the difficulties involved in locating documents of interest. We describe a proposal for a New Zealand-based index of computer science technical reports, where the reports themselves are located in repositories that are distributed world-wide. Our scheme is unique in that it is based on indexing the full text of the technical reports, rather than on document surrogates. The index is constructed so as to minimize network traffic and local storage costs (of particular importance for geographically isolated countries like New Zealand, which incur high Internet costs). We also will provide support for bibliometric/scientometric studies of the computing literature and our users.

    View record details
  • Machine learning in practice: experience with agricultural databases

    Garner, Stephen R.; Cunningham, Sally Jo; Holmes, Geoffrey; Nevill-Manning, Craig G.; Witten, Ian H. (1995-05)

    Working or discussion paper
    University of Waikato

    The Waikato Environment for Knowledge Analysis (weka) is a New Zealand government-sponsored initiative to investigate the application of machine learning to economically important problems in the agricultural industries. The overall goals are to create a workbench for machine learning, determine the factors that contribute towards its successful application in the agricultural industries, and develop new methods of machine learning and ways of assessing their effectiveness. The project began in 1993 and is currently working towards the fulfilment of three objectives: to design and implement the workbench, to provide case studies of applications of machine learning techniques to problems in agriculture, and to develop a methodology for evaluating generalisations in terms of their entropy. These three objectives are by no means independent. For example, the design of the weka workbench has been inspired by the demands placed on it by the case studies, and has also benefited from our work on evaluating the outcomes of applying a technique to data. Our experience throughout the development of the project is that the successful application of machine learning involves much more than merely executing a learning algorithm on some data. In this paper we present the process model that underpins our work over the past two years for the development of applications in agriculture; the software we have developed around our workbench of machine learning schemes to support this model; and the outcomes and problems we have encountered in developing applications.

    View record details
  • The development of Holte's 1R Classifier

    Nevill-Manning, Craig G.; Holmes, Geoffrey; Witten, Ian H. (1995-06)

    Working or discussion paper
    University of Waikato

    The 1R procedure for machine learning is a very simple one that proves surprisingly effective on the standard datasets commonly used for evaluation. This paper describes the method and discusses two areas that can be improved: the way that intervals are formed when discretizing continuously-valued attributes, and the way that missing values are treated. Then we show how the algorithm can be extended to avoid a problem endemic to most practical machine learning algorithms—their frequent dismissal of an attribute as irrelevant when in fact it is highly relevant when combined with other attributes.

    View record details
  • Signal processing for melody transcription

    McNab, Rodger J.; Smith, Lloyd A.; Witten, Ian H. (1995-08)

    Working or discussion paper
    University of Waikato

    MT is a melody transcription system that accepts acoustic input, typically sung by the user, and displays it in standard music notation. It tracks the pitch of the input and segments the pitch stream into musical notes, which are labelled by their pitches relative to a reference frequency that adapts to the user's tuning. This paper describes the signal processing operations involved, and discusses two applications that have been prototyped: a sightsinging tutor and a scheme for acoustically indexing a melody database.

    View record details
  • Building a public digital library based on full-text retrieval

    Witten, Ian H.; Nevill-Manning, Craig G.; Cunningham, Sally Jo (1995-08)

    Working or discussion paper
    University of Waikato

    Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library a reliable, quality, community service has yet to appear. This paper demonstrates the feasibility of a cost-effective collection of high-quality public-domain information, available free over the Internet. One obstacle to the creation of a digital library is the difficulty of providing formal cataloguing information. Without a title, author and subject database it seems hard to offer the searching facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of resources. A second is the problem of finding a suitable corpus of material. Computer science research reports form the focus of our prototype implementation. These constitute a large body of high-quality public-domain documents. Given such a corpus, a third issue becomes the question of obtaining both plain text for indexing, and page images for readability. Typesetting formats such as PostScript provide some of the benefits of libraries scanned from paper documents such as paged-based indexing and viewing without the physical demands and error-prone nature of scanning and optical character recognition. However, until recently the difficulty of extracting text from PostScript seems to have encouraged indexing on plain-text abstracts or bibliographic information provided by authors. We have developed a new technique that overcomes the problem. This paper describes the architecture, the indexing, collection and maintenance processes, and the retrieval interface, to a prototype public digital library.

    View record details
  • Towards the digital music library: tune retrieval from acoustic input

    McNab, Rodger J.; Smith, Lloyd A.; Witten, Ian H.; Henderson, Clare L.; Cunningham, Sally Jo (1995-10)

    Working or discussion paper
    University of Waikato

    Music is traditionally retrieved by title, composer or subject classification. It is possible, with current technology, to retrieve music from a database on the basis of a few notes sung or hummed into a microphone. This paper describes the implementation of such a system, and discusses several issues pertaining to music retrieval. We first describe an interface that transcribes acoustic input into standard music notation. We then analyze string matching requirements from ranked retrieval of music and present the results of an experiment which tests how accurately people sing well known melodies. The performance of several string matching criteria are analyzed using two folk song databases. Finally, we describe a prototype system which has been developed for retrieval of tunes from acoustic input.

    View record details
  • WEKA: a machine learning workbench

    Holmes, Geoffrey; Donkin, Andrew; Witten, Ian H. (1994-07)

    Working or discussion paper
    University of Waikato

    Weka is a workbench for machine learning that is intended to aid in the application of machine learning techniques to a variety of real-world problems, in particular, those arising from agricultural and horticultural domains. Unlike other machine learning projects, the emphasis is on providing a working environment for the domain specialist rather than the machine learning expert. Lessons learned include the necessity of providing a wealth of interactive tools for data manipulation, result visualization, database linkage, and cross-validation and comparison of rule sets, to complement the basic machine learning tools.

    View record details
  • Applying machine learning to agricultural data

    McQueen, Robert J.; Garner, Stephen R.; Nevill-Manning, Craig G.; Witten, Ian H. (1994-07)

    Working or discussion paper
    University of Waikato

    Many techniques have been developed for abstracting, or "learning," rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and error-prone process of acquiring knowledge from empirical data. While these techniques are plausible, theoretically well-founded, and perform well on more or less artificial test data sets, they stand or fall on their ability to make sense of real-world data. This paper describes a project that is applying a range of learning strategies to problems in primary industry, in particular agriculture and horticulture. We briefly survey some of the more readily applicable techniques that are emerging from the machine learning research community, describe a software workbench that allows users to experiment with a variety of techniques on real-world data sets, and detail the problems encountered and solutions developed in a case study of dairy herd management in which culling rules were inferred from a medium-sized database of herd information.

    View record details