77 results for Witten, Ian H., Conference item

  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Using a permutation test for attribute selection in decision trees

    Frank, Eibe; Witten, Ian H. (1998)

    Conference item
    University of Waikato

    Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a well-founded basis for addressing the problem. However, many statistical tests are computed from a chi-squared distribution, which is only a valid approximation to the actural distribution in the large-sample case-and this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel two-stage method for applying it to select attributes in a decision tree. Results on practical datasets compare favourably with other methods that also adopt a pre-pruning strategy.

    View record details
  • Experiences with the Greenstone digital library software for international development

    Nichols, David M.; Rose, John; Bainbridge, David; Witten, Ian H. (2010)

    Conference item
    University of Waikato

    Greenstone is a versatile open source multilingual digital library environment, emerging from research on text compression within the New Zealand Digital Library Research Project in the Department of Computer Science at the University of Waikato. In 1997 we began to work with Human Info NGO to help them produce fully-searchable CD-ROM collections of humanitarian information. The software has since evolved to support a variety of application contexts. Rather than being simply a delivery mechanism, we have emphasised the empowerment of users to create and distribute their own digital collections.

    View record details
  • Stress-testing general purpose digital library software

    Bainbridge, David; Witten, Ian H.; Boddie, Stefan J.; Thompson, John (2009)

    Conference item
    University of Waikato

    DSpace, Fedora, and Greenstone are three widely used open source digital library systems. In this paper we report on scalability tests performed on these tools by ourselves and others. These range from repositories populated with synthetically produced data to real world deployment with content measured in millions of items. A case study is presented that details how one of the systems performed when used to produce fully-searchable newspaper collections containing in excess of 20 GB of raw text (2 billion words, with 60 million unique terms), 50 GB of metadata, and 570 GB of images.

    View record details
  • Content-Based Language Learning in a Digital Library

    Wu, Shaoqun; Witten, Ian H. (2007)

    Conference item
    University of Waikato

    Digital libraries have untapped potential for supporting language teaching and learning. This paper describes a new scheme for automating topic-specific language learning using a specially built digital library. Three exercises of different types are generated automatically from the library content: one that learners undertake individually, one in which learners collaborate in pairs, and one in which a group of learners compete. The system aims to foster content-based language learning, which greatly increases students’ motivation, fosters long-term recollection, and can be culturally situated in appropriate ways.

    View record details
  • Practical digital library interoperability standards

    Bainbridge, David; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    As the field of digital libraries matures and new systems and standards develop, the ability to interoperate between systems becomes paramount. This tutorial gives a practical introduction to many recent standards and de facto standards for interoperability, and illustrates them using open source digital library software-including online demonstrations of interoperation issues and solutions. Core standards that are discussed include Dublin Core, OAI-PMH, METS, and MODS. We use interoperation between Greenstone and DSpace as a motivating case study. For those demonstrations that involve Greenstone, attendees who wish to may bring their laptops, install Greenstone from a CD-ROM that we will provide, along with various sample files, and follow along with the demonstrations on their own machine.

    View record details
  • Building digital library collections with greenstone

    Witten, Ian H.; Bainbridge, David (2005)

    Conference item
    University of Waikato

    This tutorial will demonstrate how to build a variety of different kinds of digital library collections with the Greenstone digital library software, a comprehensive, open-source system for constructing, presenting, and maintaining information collections. Collections will be built from HTML documents; Word, PDF and PostScript documents; images in various formats; MP3 and MIDI audio; MARC records; and more. For each collection, various different full-text search indexes and metadata-based browsers will be created. Attendees who wish to are encouraged to bring their laptops, install Greenstone from a CD-ROM that we will provide, along with various sample files, and follow along with the demonstrations on their own machine.

    View record details
  • Assembling and enriching digital library collections

    Bainbridge, David; Thompson, John; Witten, Ian H. (2003)

    Conference item
    University of Waikato

    People who create digital libraries need to gather together the raw material, add metadata as necessary, and design and build new collections. This paper sets out the requirements for these tasks and describes a new tool that supports them interactively, making it easy for users to create their own collections from electronic files of all types. The process involves selecting documents for inclusion, coming up with a suitable metadata set, assigning metadata to each document or group of documents, designing the form of the collection in terms of document formats, searchable indexes, and browsing facilities, building the necessary indexes and data structures, and putting the collection in place for others to use. Moreover, different situations require different workflows, and the system must be flexible enough to cope with these demands. Although the tool is specific to the Greenstone digital library software, the underlying ideas should prove useful in more general contexts.

    View record details
  • Document level interoperability for Collection Creators

    Bainbridge, David; Ke, Kaun-Yu; Witten, Ian H. (2006)

    Conference item
    University of Waikato

    Digital library interoperability for both documents and metadata is a critical and complex issue. Although many relevant standards have been developed, and continue to evolve, in practice things are not quite so easy as they seem. We have built a software environment called the Exchange Center that helps digital librarians manage the process of sourcing documents and metadata from various repositories, adding local content where necessary, and exporting the resulting collection into formats that are suitable for digital library repositories. This paper describes the software, which is built on Greenstone but does not require its use as the final digital library server.

    View record details
  • A retrospective look at Greenstone: Lessons from the first decade

    Witten, Ian H.; Bainbridge, David (2007)

    Conference item
    University of Waikato

    The Greenstone Digital Library Software has helped spread the practical impact of digital library technology throughout the world, with particular emphasis on developing countries. As Greenstone enters its second decade, this article takes a retrospective look at its development, the challenges that have been faced, and the lessons that have been learned in developing and deploying a comprehensive open-source system for the construction of digital libraries internationally. Not surprisingly, the most difficult challenges have been political, educational, and sociological, echoing that old programmers' blessing "may all your problems be technical ones."

    View record details
  • A competitive environment for exploratory query expansion

    Milne, David N.; Nichols, David M.; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.

    View record details
  • Modeling for optimal probability prediction

    Wang, Yong; Witten, Ian H. (2002)

    Conference item
    University of Waikato

    We present a general modelling method for optimal probability prediction over future observations, in which model dimensionality is determined as a natural by-product. This new method yields several estimators, and we establish theoretically that they are optimal (either overall or under stated restrictions) when the number of free parameters is infinite. As a case study, we investigate the problem of fitting logistic models in finite-sample situations. Simulation results on both artificial and practical datasets are supportive.

    View record details
  • A knowledge-based search engine powered by Wikipedia

    Milne, David N.; Witten, Ian H.; Nichols, David M. (2007)

    Conference item
    University of Waikato

    This paper describes Koru, a new search interface that offers effective domain-independent knowledge-based information retrieval. Koru exhibits an understanding of the topics of both queries and documents. This allows it to (a) expand queries automatically and (b) help guide the user as they evolve their queries interactively. Its understanding is mined from the vast investment of manual effort and judgment that is Wikipedia. We show how this open, constantly evolving encyclopedia can yield inexpensive knowledge structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We conducted a detailed user study with 12 participants and 10 topics from the 2005 TREC HARD track, and found that Koru and its underlying knowledge base offers significant advantages over traditional keyword search. It was capable of lending assistance to almost every query issued to it; making their entry more efficient, improving the relevance of the documents they return, and narrowing the gap between expert and novice seekers.

    View record details
  • Perambulating libraries: Demonstrating how a Victorian idea can help OLPC users share books

    Witten, Ian H.; Bainbridge, David (2011)

    Conference item
    University of Waikato

    In this extended abstract we detail how the open source digital library toolkit Greenstone [5] can help users of the XOlaptop— produced by the One Laptop Per Child Foundation— manage and share electronic documents. The idea draws upon mobile libraries (bookmobiles) for its inspiration, which first appeared in Victorian times. The implemented technique works by building on the Mesh network that is instrumental to the XO-laptop approach. To use the technique, on each portable XO-laptop a version of Greenstone is installed, allowing the owner to develop and manage their own set of books. The version of Greenstone has been adapted to support a form of interoperability we have called Digital Library Talkback. On the Mesh, when two XO-laptops “see” each other, the two users can search and browse the other user’s digital library; when they see a book they like, they can have it transferred to their library with a single click using the Digital Library Talkback mechanism.

    View record details
  • A link-based visual search engine for Wikipedia

    Milne, David N.; Witten, Ian H. (2011)

    Conference item
    University of Waikato

    This paper introduces Hopara, a new search engine that aims to make Wikipedia easier to explore. It works on top of the encyclopedia's existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.

    View record details