182 results for Witten, Ian H.

  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Building a public digital library based on full-text retrieval

    Witten, Ian H.; Nevill-Manning, Craig G.; Cunningham, Sally Jo (1995-08)

    Working or discussion paper
    University of Waikato

    Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library a reliable, quality, community service has yet to appear. This paper demonstrates the feasibility of a cost-effective collection of high-quality public-domain information, available free over the Internet. One obstacle to the creation of a digital library is the difficulty of providing formal cataloguing information. Without a title, author and subject database it seems hard to offer the searching facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of resources. A second is the problem of finding a suitable corpus of material. Computer science research reports form the focus of our prototype implementation. These constitute a large body of high-quality public-domain documents. Given such a corpus, a third issue becomes the question of obtaining both plain text for indexing, and page images for readability. Typesetting formats such as PostScript provide some of the benefits of libraries scanned from paper documents such as paged-based indexing and viewing without the physical demands and error-prone nature of scanning and optical character recognition. However, until recently the difficulty of extracting text from PostScript seems to have encouraged indexing on plain-text abstracts or bibliographic information provided by authors. We have developed a new technique that overcomes the problem. This paper describes the architecture, the indexing, collection and maintenance processes, and the retrieval interface, to a prototype public digital library.

    View record details
  • Bi-level document image compression using layout information

    Inglis, Stuart J.; Witten, Ian H. (1996-01)

    Working or discussion paper
    University of Waikato

    Most bi-level images stored on computers today comprise scanned text, and their number is escalating because of the drive to archive large volumes of paper-based material electronically. These documents are stored using generic bi-level image technology, based either on classical run-length coding, such as the CCITT Group 4 method, or on modern schemes such as JBIG that predict pixels from their local image context. However, image compression methods that are tailored specifically for images known to contain printed text can provide noticeably superior performance because they effectively enlarge the context to the character level, at least for those predictions for which such a context is relevant. To deal effectively with general documents that contain text and pictures, it is necessary to detect layout and structural information from the image, and employ different compression techniques for different parts of the image. Such techniques are called document image compression methods.

    View record details
  • Learning agents: from user study to implementation

    Maulsby, David; Witten, Ian H. (1996-04)

    Working or discussion paper
    University of Waikato

    Learning agents acquire procedures by being taught rather than programmed. To teach effectively, users prefer communicating in richer and more flexible ways than traditional computer dialogs allow. This paper describes the design, implementation and evaluation of a learning agent. In contrast to most Artificial Intelligence projects, the design centers on a user study, with a human-simulated agent to discover the interactions that people find natural. Our work shows that users instinctively communication via "hints," or partially-specified, ambiguous, instructions. Hints may be input verbally, or by pointing, or by selecting from menus. They may be unsolicited, or arise in response to a query from the agent. We develop a theory of instruction types for an agent to interpret them. The implementation demonstrates that computers can learn from examples and ambiguous hints. Finally, an evaluation reveals the extent to which our system meets the original design requirements.

    View record details
  • Generating accurate rule sets without global optimization

    Frank, Eibe; Witten, Ian H. (1998-01)

    Working or discussion paper
    University of Waikato

    The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER’s. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.

    View record details
  • Automating iterative tasks with programming by demonstration: a user evaluation

    Paynter, Gordon W.; Witten, Ian H. (1999-05)

    Working or discussion paper
    University of Waikato

    Computer users often face iterative tasks that cannot be automated using the tools and aggregation techniques provided by their application program: they end up performing the iteration by hand, repeating user interface actions over and over again. We have implemented an agent, called Familiar, that can be taught to perform iterative tasks using programming by demonstration (PBD). Unlike other PBD systems, it is domain independent and works with unmodified, widely-used, applications in a popular operating system. In a formal evaluation, we found that users quickly learned to use the agent to automate iterative tasks. Generally, the participants preferred to use multiple selection where possible, but could and did use PBD in situations involving iteration over many commands, or when other techniques were unavailable.

    View record details
  • Understanding what machine learning produces - Part II: Knowledge visualization techniques

    Cunningham, Sally Jo; Humphrey, Matthew C.; Witten, Ian H. (1996-10)

    Working or discussion paper
    University of Waikato

    Researchers in machine learning use decision trees, production rules, and decision graphs for visualizing classification data. Part I of this paper surveyed these representations, paying particular attention to their comprehensibility for non-specialist users. Part II turns attention to knowledge visualization—the graphic form in which a structure is portrayed and its strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. These knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.

    View record details
  • Displaying 3D images: algorithms for single-image random-dot

    Thimbleby, Harold W.; Inglis, Stuart J.; Witten, Ian H. (1994-10-01)

    Journal article
    University of Waikato

    A new, simple, and symmetric algorithm can be implemented that results in higher levels of detail in solid objects than previously possible with autostereograms. In a stereoscope, an optical instrument similar to binoculars, each eye views a different picture and thereby receives the specific image that would have arisen naturally. An early suggestion for a color stereo computer display involved a rotating filter wheel held in front of the eyes. In contrast, this article describes a method for viewing on paper or on an ordinary computer screen without special equipment, although it is limited to the display of 3D monochromatic objects. (The image can be colored, say, for artistic reasons, but the method we describe does not allow colors to be allocated in a way that corresponds to an arbitrary coloring of the solid object depicted.) The image can easily be constructed by computer from any 3D scene or solid object description.

    View record details
  • Understanding what machine learning produces - Part I: Representations and their comprehensibility

    Cunningham, Sally Jo; Humphrey, Matthew C.; Witten, Ian H. (1996-10)

    Working or discussion paper
    University of Waikato

    The aim of many machine learning users is to comprehend the structures that are inferred from a dataset, and such users may be far more interested in understanding the structure of their data than in predicting the outcome of new test data. Part I of this paper surveys representations based on decision trees, production rules and decision graphs that have been developed and used for machine learning. These representations have differing degrees of expressive power, and particular attention is paid to their comprehensibility for non-specialist users. The graphic form in which a structure is portrayed also has a strong effect on comprehensibility, and Part II of this paper develops knowledge visualization techniques that are particularly appropriate to help answer the questions that machine learning users typically ask about the structures produced.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • The MG retrieval system: compressing for space and speed

    Bell, Timothy C.; Moffat, Alistair; Witten, Ian H.; Zobel, Justin (1995)

    Journal article
    University of Waikato

    Recent advances in compression and indexing techniques have yielded a qualitative change in the feasibility of large-scale full-text retrieval.

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • StoneD: A bridge between Greenstone and DSpace

    Witten, Ian H.; Bainbridge, David; Tansley, Robert; Huang, Chi-Yu; Don, Katherine J. (2005-04)

    Working or discussion paper
    University of Waikato

    Greenstone and DSpace are widely-used software systems for digital libraries, and prospective users sometimes wonder which one to adopt. In fact, the aims of the two are very different, although their domains of application do overlap. This paper describes the systems and identifies their similarities and differences. We also present StoneD, a stone bridge between the production versions of Greenstone and DSpace that allows users of either system to easily migrate to the other, or continue with a combination of both. This bridge eliminates the risk of finding oneself locked in to an inappropriate choice of system. We also discuss other possible opportunities for combining the advantages of the two, to the benefit of the user communities of both systems.

    View record details
  • Compression and explanation using hierarchical grammars

    Nevill-Manning, Craig G.; Witten, Ian H. (1996-07)

    Working or discussion paper
    University of Waikato

    Data compression is an eminently pragmatic pursuit: by removing redundancy, storage can be utilised more efficiently. Identifying redundancy also serves a less prosaic purpose-it provides cues for detecting structure, and the recognition of structure coincides with one of the goals of artificial intelligence: to make sense of the world by algorithmic means. This paper describes an algorithm that excels at both data compression and structural inference. This algorithm is implemented in a system call SEQUITUR that efficiently deals with sequences containing millions of symbols.

    View record details
  • An MDL estimate of the significance of rules

    Cleary, John G.; Legg, Shane; Witten, Ian H. (1996-03)

    Working or discussion paper
    University of Waikato

    This paper proposes a new method for measuring the performance of models-whether decision trees or sets of rules-inferred by machine learning methods. Inspired by the minimum description length (MDL) philosophy and theoretically rooted in information theory, the new method measures the complexity of text data with respect to the model. It has been evaluated on rule sets produced by several different machine learning schemes on a large number of standard data sets. When compared with the usual percentage correct measure, it is shown to agree with it in restricted cases. However, in other more general cases taken from real data sets-for example, when rule sets make multiple or no predictions-it disagrees substantially. It is argued that the MDL measure is more reasonable in these cases and represents a better way of assessing the significance of a rule set's performance. The question of the complexity of the rule set itself is not addressed in the paper.

    View record details
  • Induction of model trees for predicting continuous classes

    Wang, Yong; Witten, Ian H. (1996-10)

    Working or discussion paper
    University of Waikato

    Many problems encountered when applying machine learning in practice involve predicting a "class" that takes on a continuous numeric value, yet few machine learning schemes are able to do this. This paper describes a "rational reconstruction" of M5, a method developed by Quinlan (1992) for inducing trees of regression models. In order to accommodate data typically encountered in practice it is necessary to deal effectively with enumerated attributes and with missing values, and techniques devised by Breiman et al. (1984) are adapted for this purpose. The resulting system seems to outperform M5, based on the scanty published data that is available.

    View record details
  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Selecting multiway splits in decision trees

    Frank, Eibe; Witten, Ian H. (1996-12)

    Working or discussion paper
    University of Waikato

    Decision trees in which numeric attributes are split several ways are more comprehensible than the usual binary trees because attributes rarely appear more than once in any path from root to leaf. There are efficient algorithms for finding the optimal multiway split for a numeric attribute, given the number of intervals in which it is to be divided. The problem we tackle is how to choose this number in order to obtain small, accurate trees.

    View record details