77 results for Witten, Ian H., Conference item

  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Content-Based Language Learning in a Digital Library

    Wu, Shaoqun; Witten, Ian H. (2007)

    Conference item
    University of Waikato

    Digital libraries have untapped potential for supporting language teaching and learning. This paper describes a new scheme for automating topic-specific language learning using a specially built digital library. Three exercises of different types are generated automatically from the library content: one that learners undertake individually, one in which learners collaborate in pairs, and one in which a group of learners compete. The system aims to foster content-based language learning, which greatly increases students’ motivation, fosters long-term recollection, and can be culturally situated in appropriate ways.

    View record details
  • Creating and customizing digital library collections with the Greenstone Librarian Interface

    Witten, Ian H. (2004)

    Conference item
    University of Waikato

    The Greenstone digital library software is a comprehensive system for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet. This paper describes how digital library collections can be created and customized with the new Greenstone Librarian Interface. Its basic features allow users to add documents and metadata to collections, create new collections whose structure mirrors existing ones, and build collections and put them in place so for users to view. More advanced users can design and customize new collection structures. At the most advanced level, the Librarian Interface gives expert users interactive access to the full power of Greenstone, which could formerly be tapped only by running Perl scripts manually.

    View record details
  • Practical digital library interoperability standards

    Bainbridge, David; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    As the field of digital libraries matures and new systems and standards develop, the ability to interoperate between systems becomes paramount. This tutorial gives a practical introduction to many recent standards and de facto standards for interoperability, and illustrates them using open source digital library software-including online demonstrations of interoperation issues and solutions. Core standards that are discussed include Dublin Core, OAI-PMH, METS, and MODS. We use interoperation between Greenstone and DSpace as a motivating case study. For those demonstrations that involve Greenstone, attendees who wish to may bring their laptops, install Greenstone from a CD-ROM that we will provide, along with various sample files, and follow along with the demonstrations on their own machine.

    View record details
  • A new framework for building digital library collections

    Buchanan, George; Bainbridge, David; Don, Katherine J.; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    This paper introduces a new framework for building digital library collections and contrasts it with existing systems. It describes a significant new step in the development of a widely-used open-source digital library system, Greenstone, which has evolved over many years. It is supported by a fresh implementation, which forced us to rethink the entire design rather than making incremental improvements. The redesign capitalizes on the best ideas from the existing system, which have been refined and developed to open new avenues through which digital librarians can tailor their collections. We demonstrate its flexibility by showing how digital library collections can be extended and altered to satisfy new requirements.

    View record details
  • Building digital library collections with greenstone

    Witten, Ian H.; Bainbridge, David (2005)

    Conference item
    University of Waikato

    This tutorial will demonstrate how to build a variety of different kinds of digital library collections with the Greenstone digital library software, a comprehensive, open-source system for constructing, presenting, and maintaining information collections. Collections will be built from HTML documents; Word, PDF and PostScript documents; images in various formats; MP3 and MIDI audio; MARC records; and more. For each collection, various different full-text search indexes and metadata-based browsers will be created. Attendees who wish to are encouraged to bring their laptops, install Greenstone from a CD-ROM that we will provide, along with various sample files, and follow along with the demonstrations on their own machine.

    View record details
  • Extending Greenstone for Institutional Repositories

    Bainbridge, David; Osborn, Wendy; Witten, Ian H.; Nichols, David M. (2006)

    Conference item
    University of Waikato

    We examine the problem of designing a generalized system for building institutional repositories. Widely used schemes such as DSpace are tailored to a particular set of requirements: fixed metadata set; standard view when searching and browsing; pre-determined sequence for depositing items; built-in workflow for vetting new items. In contrast, Fedora builds in flexibility: institutional repositories are just one possible instantiation—however generality incurs a high overhead and uptake has been sluggish. This paper shows how existing components of the Greenstone software can be repurposed to provide a generalized institutional repository that falls between these extremes.

    View record details
  • Managing change in a digital library system with many interface languages

    Bainbridge, David; Edgar, Katrina D.; Witten, Ian H.; McPherson, John R. (2003)

    Conference item
    University of Waikato

    Managing the organizational and software complexity of a comprehensive open source digital library system presents a significant challenge. The challenge becomes even more imposing when the interface is available in different languages, for enhancements to the software and changes to the interface must be faithfully reflected in each language version. This paper describes the solution adopted by Greenstone, a multilingual digital library system distributed by UNESCO in a trilingual European version (English, French, Spanish), complete with all documentation, and whose interface is available in many further languages. Greenstone incorporates a language translation facility which allows authorized people to update the interface in specified languages. A standard version control system is used to manage software change, and from this the system automatically determines which language fragments need updating and presents them to the human translator.

    View record details
  • Assembling and enriching digital library collections

    Bainbridge, David; Thompson, John; Witten, Ian H. (2003)

    Conference item
    University of Waikato

    People who create digital libraries need to gather together the raw material, add metadata as necessary, and design and build new collections. This paper sets out the requirements for these tasks and describes a new tool that supports them interactively, making it easy for users to create their own collections from electronic files of all types. The process involves selecting documents for inclusion, coming up with a suitable metadata set, assigning metadata to each document or group of documents, designing the form of the collection in terms of document formats, searchable indexes, and browsing facilities, building the necessary indexes and data structures, and putting the collection in place for others to use. Moreover, different situations require different workflows, and the system must be flexible enough to cope with these demands. Although the tool is specific to the Greenstone digital library software, the underlying ideas should prove useful in more general contexts.

    View record details
  • How to turn the page

    Chu, Yi-Chun; Witten, Ian H.; Lobb, Richard; Bainbridge, David (2003)

    Conference item
    University of Waikato

    Can digital libraries provide a reading experience that more closely resembles a real book than a scrolled or paginated electronic display? This paper describes a prototype page-turning system that realistically animates full three-dimensional page-turns. The dynamic behavior is generated by a mass-spring model defined on a rectangular grid of particles. The prototype takes a PDF or E-book file, renders it into a sequence of PNG images representing individual pages, and animates the pageturns under user control. The simulation behaves fairly naturally, although more computer graphics work is required to perfect it.

    View record details
  • A user evaluation of hierarchical phrase browsing

    Edgar, Katrina D.; Nichols, David M.; Paynter, Gordon W.; Thomson, Kirsten; Witten, Ian H. (2003)

    Conference item
    University of Waikato

    Phrase browsing interfaces based on hierarchies of phrases extracted automatically from document collections offer a useful compromise between automatic full-text searching and manually-created subject indexes. The literature contains descriptions of such systems that many find compelling and persuasive. However, evaluation studies have either been anecdotal, or focused on objective measures of the quality of automatically-extracted index terms, or restricted to questions of computational efficiency and feasibility. This paper reports on an empirical, controlled user study that compares hierarchical phrase browsing with full-text searching over a range of information seeking tasks. Users found the results located via phrase browsing to be relevant and useful but preferred keyword searching for certain types of queries. Users experiences were marred by interface details, including inconsistencies between the phrase browser and the surrounding digital library interface.

    View record details
  • Token identification using HMM and PPM models

    Wen, Yingying; Witten, Ian H.; Wang, Dianhui (2003)

    Conference item
    University of Waikato

    Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language processing tasks including learning-based token identification. Most of the existing systems are domain- and language-dependent. The power of retargetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on token identification. We implement a system that bridges the two well known methods through words new to the identification model. The system is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only required input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with language-dependent components, our proposed system has power to deal with large scope of domain- and language-independent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people s name with correct tokens of 68% for TCC, and 76% for BIB.

    View record details
  • Learning language using genetic algorithms

    Smith, Tony C.; Witten, Ian H. (1996)

    Conference item
    University of Waikato

    Strict pattern-based methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent well-formed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zero-frequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical n-grams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong. This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, fault-tolerant method for inferring grammars from positive examples of natural language.

    View record details
  • Using a permutation test for attribute selection in decision trees

    Frank, Eibe; Witten, Ian H. (1998)

    Conference item
    University of Waikato

    Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a well-founded basis for addressing the problem. However, many statistical tests are computed from a chi-squared distribution, which is only a valid approximation to the actural distribution in the large-sample case-and this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel two-stage method for applying it to select attributes in a decision tree. Results on practical datasets compare favourably with other methods that also adopt a pre-pruning strategy.

    View record details
  • A fedora librarian interface

    Bainbridge, David; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    The Fedora content management system embodies a powerful and flexible digital object model. This paper describes a new open-source software front-end that enables end-user librarians to transfer documents and metadata in a variety of formats into a Fedora repository. The main graphical facility that Fedora itself provides for this task operates on one document at a time and is not librarian-friendly. A batch driven alternative is possible, but requires documents to be converted beforehand into the XML format used by the repository, necessitating a need for programming skills. In contrast, our new scheme allows arbitrary collections of documents residing on the user's computer (or the web at large) to be ingested into a Fedora repository in one operation, without a need for programming expertise. Provision is also made for editing existing documents and metadata, and adding new ones. The documents can be in a wide variety of different formats, and the user interface is suitable for practicing librarians. The design capitalizes on our experience in building the Greenstone librarian interface and participating in dozens of workshops with librarians worldwide.

    View record details