56 results for Witten, Ian H., Working or discussion paper

  • Building a public digital library based on full-text retrieval

    Witten, Ian H.; Nevill-Manning, Craig G.; Cunningham, Sally Jo (1995-08)

    Working or discussion paper
    University of Waikato

    Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library a reliable, quality, community service has yet to appear. This paper demonstrates the feasibility of a cost-effective collection of high-quality public-domain information, available free over the Internet. One obstacle to the creation of a digital library is the difficulty of providing formal cataloguing information. Without a title, author and subject database it seems hard to offer the searching facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of resources. A second is the problem of finding a suitable corpus of material. Computer science research reports form the focus of our prototype implementation. These constitute a large body of high-quality public-domain documents. Given such a corpus, a third issue becomes the question of obtaining both plain text for indexing, and page images for readability. Typesetting formats such as PostScript provide some of the benefits of libraries scanned from paper documents such as paged-based indexing and viewing without the physical demands and error-prone nature of scanning and optical character recognition. However, until recently the difficulty of extracting text from PostScript seems to have encouraged indexing on plain-text abstracts or bibliographic information provided by authors. We have developed a new technique that overcomes the problem. This paper describes the architecture, the indexing, collection and maintenance processes, and the retrieval interface, to a prototype public digital library.

    View record details
  • Bi-level document image compression using layout information

    Inglis, Stuart J.; Witten, Ian H. (1996-01)

    Working or discussion paper
    University of Waikato

    Most bi-level images stored on computers today comprise scanned text, and their number is escalating because of the drive to archive large volumes of paper-based material electronically. These documents are stored using generic bi-level image technology, based either on classical run-length coding, such as the CCITT Group 4 method, or on modern schemes such as JBIG that predict pixels from their local image context. However, image compression methods that are tailored specifically for images known to contain printed text can provide noticeably superior performance because they effectively enlarge the context to the character level, at least for those predictions for which such a context is relevant. To deal effectively with general documents that contain text and pictures, it is necessary to detect layout and structural information from the image, and employ different compression techniques for different parts of the image. Such techniques are called document image compression methods.

    View record details
  • Learning agents: from user study to implementation

    Maulsby, David; Witten, Ian H. (1996-04)

    Working or discussion paper
    University of Waikato

    Learning agents acquire procedures by being taught rather than programmed. To teach effectively, users prefer communicating in richer and more flexible ways than traditional computer dialogs allow. This paper describes the design, implementation and evaluation of a learning agent. In contrast to most Artificial Intelligence projects, the design centers on a user study, with a human-simulated agent to discover the interactions that people find natural. Our work shows that users instinctively communication via "hints," or partially-specified, ambiguous, instructions. Hints may be input verbally, or by pointing, or by selecting from menus. They may be unsolicited, or arise in response to a query from the agent. We develop a theory of instruction types for an agent to interpret them. The implementation demonstrates that computers can learn from examples and ambiguous hints. Finally, an evaluation reveals the extent to which our system meets the original design requirements.

    View record details
  • Generating accurate rule sets without global optimization

    Frank, Eibe; Witten, Ian H. (1998-01)

    Working or discussion paper
    University of Waikato

    The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER’s. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.

    View record details
  • Automating iterative tasks with programming by demonstration: a user evaluation

    Paynter, Gordon W.; Witten, Ian H. (1999-05)

    Working or discussion paper
    University of Waikato

    Computer users often face iterative tasks that cannot be automated using the tools and aggregation techniques provided by their application program: they end up performing the iteration by hand, repeating user interface actions over and over again. We have implemented an agent, called Familiar, that can be taught to perform iterative tasks using programming by demonstration (PBD). Unlike other PBD systems, it is domain independent and works with unmodified, widely-used, applications in a popular operating system. In a formal evaluation, we found that users quickly learned to use the agent to automate iterative tasks. Generally, the participants preferred to use multiple selection where possible, but could and did use PBD in situations involving iteration over many commands, or when other techniques were unavailable.

    View record details
  • Understanding what machine learning produces - Part II: Knowledge visualization techniques

    Cunningham, Sally Jo; Humphrey, Matthew C.; Witten, Ian H. (1996-10)

    Working or discussion paper
    University of Waikato

    Researchers in machine learning use decision trees, production rules, and decision graphs for visualizing classification data. Part I of this paper surveyed these representations, paying particular attention to their comprehensibility for non-specialist users. Part II turns attention to knowledge visualization—the graphic form in which a structure is portrayed and its strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. These knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.

    View record details
  • Understanding what machine learning produces - Part I: Representations and their comprehensibility

    Cunningham, Sally Jo; Humphrey, Matthew C.; Witten, Ian H. (1996-10)

    Working or discussion paper
    University of Waikato

    The aim of many machine learning users is to comprehend the structures that are inferred from a dataset, and such users may be far more interested in understanding the structure of their data than in predicting the outcome of new test data. Part I of this paper surveys representations based on decision trees, production rules and decision graphs that have been developed and used for machine learning. These representations have differing degrees of expressive power, and particular attention is paid to their comprehensibility for non-specialist users. The graphic form in which a structure is portrayed also has a strong effect on comprehensibility, and Part II of this paper develops knowledge visualization techniques that are particularly appropriate to help answer the questions that machine learning users typically ask about the structures produced.

    View record details
  • StoneD: A bridge between Greenstone and DSpace

    Witten, Ian H.; Bainbridge, David; Tansley, Robert; Huang, Chi-Yu; Don, Katherine J. (2005-04)

    Working or discussion paper
    University of Waikato

    Greenstone and DSpace are widely-used software systems for digital libraries, and prospective users sometimes wonder which one to adopt. In fact, the aims of the two are very different, although their domains of application do overlap. This paper describes the systems and identifies their similarities and differences. We also present StoneD, a stone bridge between the production versions of Greenstone and DSpace that allows users of either system to easily migrate to the other, or continue with a combination of both. This bridge eliminates the risk of finding oneself locked in to an inappropriate choice of system. We also discuss other possible opportunities for combining the advantages of the two, to the benefit of the user communities of both systems.

    View record details
  • Compression and explanation using hierarchical grammars

    Nevill-Manning, Craig G.; Witten, Ian H. (1996-07)

    Working or discussion paper
    University of Waikato

    Data compression is an eminently pragmatic pursuit: by removing redundancy, storage can be utilised more efficiently. Identifying redundancy also serves a less prosaic purpose-it provides cues for detecting structure, and the recognition of structure coincides with one of the goals of artificial intelligence: to make sense of the world by algorithmic means. This paper describes an algorithm that excels at both data compression and structural inference. This algorithm is implemented in a system call SEQUITUR that efficiently deals with sequences containing millions of symbols.

    View record details
  • An MDL estimate of the significance of rules

    Cleary, John G.; Legg, Shane; Witten, Ian H. (1996-03)

    Working or discussion paper
    University of Waikato

    This paper proposes a new method for measuring the performance of models-whether decision trees or sets of rules-inferred by machine learning methods. Inspired by the minimum description length (MDL) philosophy and theoretically rooted in information theory, the new method measures the complexity of text data with respect to the model. It has been evaluated on rule sets produced by several different machine learning schemes on a large number of standard data sets. When compared with the usual percentage correct measure, it is shown to agree with it in restricted cases. However, in other more general cases taken from real data sets-for example, when rule sets make multiple or no predictions-it disagrees substantially. It is argued that the MDL measure is more reasonable in these cases and represents a better way of assessing the significance of a rule set's performance. The question of the complexity of the rule set itself is not addressed in the paper.

    View record details
  • Induction of model trees for predicting continuous classes

    Wang, Yong; Witten, Ian H. (1996-10)

    Working or discussion paper
    University of Waikato

    Many problems encountered when applying machine learning in practice involve predicting a "class" that takes on a continuous numeric value, yet few machine learning schemes are able to do this. This paper describes a "rational reconstruction" of M5, a method developed by Quinlan (1992) for inducing trees of regression models. In order to accommodate data typically encountered in practice it is necessary to deal effectively with enumerated attributes and with missing values, and techniques devised by Breiman et al. (1984) are adapted for this purpose. The resulting system seems to outperform M5, based on the scanty published data that is available.

    View record details
  • Selecting multiway splits in decision trees

    Frank, Eibe; Witten, Ian H. (1996-12)

    Working or discussion paper
    University of Waikato

    Decision trees in which numeric attributes are split several ways are more comprehensible than the usual binary trees because attributes rarely appear more than once in any path from root to leaf. There are efficient algorithms for finding the optimal multiway split for a numeric attribute, given the number of intervals in which it is to be divided. The problem we tackle is how to choose this number in order to obtain small, accurate trees.

    View record details
  • Weka: Practical machine learning tools and techniques with Java implementations

    Witten, Ian H.; Frank, Eibe; Trigg, Leonard E.; Hall, Mark A.; Holmes, Geoffrey; Cunningham, Sally Jo (1999-08)

    Working or discussion paper
    University of Waikato

    The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.

    View record details
  • Interactive concept learning for end-user applications

    Maulsby, David; Witten, Ian H. (1995-02)

    Working or discussion paper
    University of Waikato

    Personalizable software agents will learn new tasks from their users. This implies being able to learn from instructions users might give: examples, yes/no responses, and ambiguous, incomplete hints. Agents should also exploit background knowledge customized for applications such as drawing, word processing and form-filling. The task models that agents learn describe data, actions and their context. Learning about data from examples and hints is the subject of this paper. The Cima learning system combines evidence from examples, task knowledge and user hints to form Disjunctive Normal Form (DNF) rules for classifying, generating or modifying data. Cima's dynamic bias manager generates candidate features (attribute values, functions or relations), from which its DNF learning algorithm selects relevant features and forms the rules. The algorithm is based on a classic greedy method, with two enhancements. First, the standard learning criterion, correct classification, is augmented with a set of utility and instructional criteria. Utility criteria ensure that descriptions are properly formed for use in actions, whether to classify, search for, generate or modify data. Instructional criteria ensure that descriptions include features that users suggest and avoid those that users reject. The second enhancement is to augment the usual statistical metric for selecting relevant attributes with a set of heuristics, including beliefs based on user suggestions and application-specific background knowledge. Using multiple heuristics increases the justification for selecting features; more important, it helps the learner choose among alternative interpretations of hints. When tested on dialogues observed in a prior user study on a simulated interface agent, the learning algorithm achieves 95% of the learning efficiency standard established in that study.

    View record details
  • Stacking bagged and dagged models

    Ting, Kai Ming; Witten, Ian H. (1997-03)

    Working or discussion paper
    University of Waikato

    In this paper, we investigate the method of stacked generalization in combining models derived from different subsets of a training dataset by a single learning algorithm, as well as different algorithms. The simplest way to combine predictions from competing models is majority vote, and the effect of the sampling regime used to generate training subsets has already been studied in this context-when bootstrap samples are used the method is called bagging, and for disjoint samples we call it dagging. This paper extends these studies to stacked generalization, where a learning algorithm is employed to combine the models. This yields new methods dubbed bag-stacking and dag-stacking. We demonstrate that bag-stacking and dag-stacking can be effective for classification tasks even when the training samples cover just a small fraction of the full dataset. In contrast to earlier bagging results, we show that bagging and bag-stacking work for stable as well as unstable learning algorithms, as do dagging and dag-stacking. We find that bag-stacking (dag-stacking) almost always has higher predictive accuracy than bagging (dagging), and we also show that bag-stacking models derived using two different algorithms is more effective than conventional bagging.

    View record details
  • A New Zealand digital library for computer science research

    Witten, Ian H.; Cunningham, Sally Jo; Vallabh, Mahendra; Bell, Timothy C. (1995-03)

    Working or discussion paper
    University of Waikato

    A large amount of computing literature has become available over the Internet, as university departments and research institutions have made their technical reports, preprints, and theses available electronically. Access to these items has been limited, however, by the difficulties involved in locating documents of interest. We describe a proposal for a New Zealand-based index of computer science technical reports, where the reports themselves are located in repositories that are distributed world-wide. Our scheme is unique in that it is based on indexing the full text of the technical reports, rather than on document surrogates. The index is constructed so as to minimize network traffic and local storage costs (of particular importance for geographically isolated countries like New Zealand, which incur high Internet costs). We also will provide support for bibliometric/scientometric studies of the computing literature and our users.

    View record details
  • Using model trees for classification

    Frank, Eibe; Wang, Yong; Inglis, Stuart J.; Holmes, Geoffrey; Witten, Ian H. (1997-04)

    Working or discussion paper
    University of Waikato

    Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5’, based on Quinlan’s M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.

    View record details
  • Signal processing for melody transcription

    McNab, Rodger J.; Smith, Lloyd A.; Witten, Ian H. (1995-08)

    Working or discussion paper
    University of Waikato

    MT is a melody transcription system that accepts acoustic input, typically sung by the user, and displays it in standard music notation. It tracks the pitch of the input and segments the pitch stream into musical notes, which are labelled by their pitches relative to a reference frequency that adapts to the user's tuning. This paper describes the signal processing operations involved, and discusses two applications that have been prototyped: a sightsinging tutor and a scheme for acoustically indexing a melody database.

    View record details
  • Applying machine learning to agricultural data

    McQueen, Robert J.; Garner, Stephen R.; Nevill-Manning, Craig G.; Witten, Ian H. (1994-07)

    Working or discussion paper
    University of Waikato

    Many techniques have been developed for abstracting, or "learning," rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and error-prone process of acquiring knowledge from empirical data. While these techniques are plausible, theoretically well-founded, and perform well on more or less artificial test data sets, they stand or fall on their ability to make sense of real-world data. This paper describes a project that is applying a range of learning strategies to problems in primary industry, in particular agriculture and horticulture. We briefly survey some of the more readily applicable techniques that are emerging from the machine learning research community, describe a software workbench that allows users to experiment with a variety of techniques on real-world data sets, and detail the problems encountered and solutions developed in a case study of dairy herd management in which culling rules were inferred from a medium-sized database of herd information.

    View record details
  • Data transformation: a semantically-based approach to function discovery

    Phan, Thong H.; Witten, Ian H. (1994-08)

    Working or discussion paper
    University of Waikato

    This paper presents the method of data transformation for discovering numeric functions from their examples. Based on the idea of transformations between functions, this method can be viewed as a semantic counterpart to the more common approach of formula construction used in most previous discovery systems. Advantages of the new method include a flexible implementation through the design of transformation rules, and a sound basis for rigorous mathematical analysis to characterize what can be discovered. The method has been implemented in a discovery system called "LINUS," which can identify a wide range of functions: rational functions, quadratic relations, and many transcendental functions, as well as those that can be transformed to rational functions by combinations of differentiation, logarithm and function inverse operations.

    View record details