77 results for Witten, Ian H., Conference item

  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Determining progression in glaucoma using visual fields

    Turpin, Andrew; Frank, Eibe; Hall, Mark A.; Witten, Ian H.; Johnson, Chris A. (2001)

    Conference item
    University of Waikato

    The standardized visual field assessment, which measures visual function in 76 locations of the central visual area, is an important diagnostic tool in the treatment of the eye disease glaucoma. It helps determine whether the disease is stable or progressing towards blindness, with important implications for treatment. Automatic techniques to classify patients based on this assessment have had limited success, primarily due to the high variability of individual visual field measurements. The purpose of this paper is to describe the problem of visual field classification to the data mining community, and assess the success of data mining techniques on it. Preliminary results show that machine learning methods rival existing techniques for predicting whether glaucoma is progressing—though we have not yet been able to demonstrate improvements that are statistically significant. It is likely that further improvement is possible, and we encourage others to work on this important practical data mining problem.

    View record details
  • A bookmaker's workbench

    Liesaputra, Veronica; Witten, Ian H. (2011)

    Conference item
    University of Waikato

    We have been developing electronic Realistic Books that combine the natural advantages of electronic documents---full-text search, hyperlinks, animation, multimedia---with those of conventional books---the ambient information provided by the physical object, analog page turning, random-access navigation, bookmarks, highlighting and annotation. Although simple Realistic Books can easily be created from PDF or HTML files using a shell script or web service, it is not so easy for book designers to take advantage of advanced features that are not normally represented in the input files. This paper describes the Bookmaker's Workbench, an interactive system intended to help book designers produce Realistic Books. It incorporates many features, including a text mining option that automatically identifies significant key terms and marks them visually in the text, the ability to incorporate synonyms automatically into the full-text search capability, and include automatically generated back-of-the-book index. A user evaluation is reported that demonstrates the system's usability and learnability.

    View record details
  • Learning to link with Wikipedia

    Milne, David N.; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Out approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.

    View record details
  • Can we avoid high coupling?

    Taube-Schock, Craig; Walker, Robert J.; Witten, Ian H. (2011)

    Conference item
    University of Waikato

    It is considered good software design practice to organize source code into modules and to favour within-module connections (cohesion) over between-module connections (coupling), leading to the oft-repeated maxim "low coupling/high cohesion". Prior research into network theory and its application to software systems has found evidence that many important properties in real software systems exhibit approximately scale-free structure, including coupling; researchers have claimed that such scale-free structures are ubiquitous. This implies that high coupling must be unavoidable, statistically speaking, apparently contradicting standard ideas about software structure. We present a model that leads to the simple predictions that approximately scale-free structures ought to arise both for between-module connectivity and overall connectivity, and not as the result of poor design or optimization shortcuts. These predictions are borne out by our large-scale empirical study. Hence we conclude that high coupling is not avoidable--and that this is in fact quite reasonable.

    View record details
  • A fedora librarian interface

    Bainbridge, David; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    The Fedora content management system embodies a powerful and flexible digital object model. This paper describes a new open-source software front-end that enables end-user librarians to transfer documents and metadata in a variety of formats into a Fedora repository. The main graphical facility that Fedora itself provides for this task operates on one document at a time and is not librarian-friendly. A batch driven alternative is possible, but requires documents to be converted beforehand into the XML format used by the repository, necessitating a need for programming skills. In contrast, our new scheme allows arbitrary collections of documents residing on the user's computer (or the web at large) to be ingested into a Fedora repository in one operation, without a need for programming expertise. Provision is also made for editing existing documents and metadata, and adding new ones. The documents can be in a wide variety of different formats, and the user interface is suitable for practicing librarians. The design capitalizes on our experience in building the Greenstone librarian interface and participating in dozens of workshops with librarians worldwide.

    View record details
  • Browsing around a Digital Library: Today and Tomorrow

    Witten, Ian H. (2000)

    Conference item
    University of Waikato

    What will it be like to work in tomorrow’s digital library? We begin by browsing around an experimental digital library of the present, glancing at some collections and showing how they are organized. Then we look to the future. Although present digital libraries are quite like conventional libraries, we argue that future ones will feel qualitatively different. Readers- and writers- will work in the library using a kind of context-directed browsing. This will be supported by structures derived from automatic analysis of the contents of the library- not just the catalogue, or abstracts, but the full text of the books and journals- using new techniques of text mining.

    View record details
  • Greenstone: A platform for distributed digital library applications

    Bainbridge, David; Buchanan, George; McPherson, John R.; Jones, Steve; Mahoui, Abdelaziz; Witten, Ian H. (2001)

    Conference item
    University of Waikato

    This paper examines the issues surrounding distributed Digital Library protocols. First, it reviews three prominent digital library protocols: Z39.50, SDLIP, and Dienst, plus Greenstone’s own protocol. Then, we summarise the implementation in the Greenstone Digital Library of a number of different protocols for distributed digital libraries, and describe sample applications of the same: a digital library for children, a translator for Stanford’s Simple Digital Library Interoperability Protocol, a Z39.50 client, and a bibliographic search tool. The paper concludes with a comparison of all four protocols, and a brief discussion of the impact of distributed protocols on the Greenstone system.

    View record details
  • Learning language using genetic algorithms

    Smith, Tony C.; Witten, Ian H. (1996)

    Conference item
    University of Waikato

    Strict pattern-based methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent well-formed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zero-frequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical n-grams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong. This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, fault-tolerant method for inferring grammars from positive examples of natural language.

    View record details
  • Bi-level document image compression using layout information

    Inglis, Stuart J.; Witten, Ian H. (1996)

    Conference item
    University of Waikato

    Most bi-level images stored on computers today comprise scanned text, and are stored using generic bi-level image technology based either on classical run-length coding, such as the CCITT Group 4 method, or on modern schemes such as JBIG that predict pixels from their local image context. However, image compression methods that are tailored specifically for images known to contain printed text can provide noticeably superior performance because they effectively enlarge the context to the character level, at least for those predictions for which such a context is relevant. To deal effectively with general documents that contain text and pictures, it is necessary to detect layout and structural information from the image, and employ different compression techniques for different parts of the image. The authors extend previous work in document image compression in two ways. First, we include automatic discrimination between text and non-text zones in an image. Second, the system is tested on a large real-world image corpus.

    View record details
  • Second language learning in the context of MOOCs

    Wu, Shaoqun; Fitzgerald, Alannah; Witten, Ian H. (2014)

    Conference item
    University of Waikato

    Massive Open Online Courses are becoming popular educational vehicles through which universities reach out to non-traditional audiences. Many enrolees hail from other countries and cultures, and struggle to cope with the English language in which these courses are invariably offered. Moreover, most such learners have a strong desire and motivation to extend their knowledge of academic English, particularly in the specific area addressed by the course. Online courses provide a compelling opportunity for domain-specific language learning. They supply a large corpus of interesting linguistic material relevant to a particular area, including supplementary images (slides), audio and video. We contend that this corpus can be automatically analysed, enriched, and transformed into a resource that learners can browse and query in order to extend their ability to understand the language used, and help them express themselves more fluently and eloquently in that domain. To illustrate this idea, an existing online corpus-based language learning tool (FLAX) is applied to a Coursera MOOC entitled Virology 1: How Viruses Work, offered by Columbia University.

    View record details
  • Constructing a focused taxonomy from a document collection

    Medelyan, Olena; Manion, Steve; Broekstra, Jeen; Divoli, Anna; Huang, Anna-Lan; Witten, Ian H. (2013)

    Conference item
    University of Waikato

    We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain.

    View record details
  • Perambulating libraries: Demonstrating how a Victorian idea can help OLPC users share books

    Witten, Ian H.; Bainbridge, David (2011)

    Conference item
    University of Waikato

    In this extended abstract we detail how the open source digital library toolkit Greenstone [5] can help users of the XOlaptop— produced by the One Laptop Per Child Foundation— manage and share electronic documents. The idea draws upon mobile libraries (bookmobiles) for its inspiration, which first appeared in Victorian times. The implemented technique works by building on the Mesh network that is instrumental to the XO-laptop approach. To use the technique, on each portable XO-laptop a version of Greenstone is installed, allowing the owner to develop and manage their own set of books. The version of Greenstone has been adapted to support a form of interoperability we have called Digital Library Talkback. On the Mesh, when two XO-laptops “see” each other, the two users can search and browse the other user’s digital library; when they see a book they like, they can have it transferred to their library with a single click using the Digital Library Talkback mechanism.

    View record details
  • Running greenstone on an iPod

    Bainbridge, David; Jones, Steve; McIntosh, Samuel John; Jones, Matt; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    The open source digital library software Greenstone is demonstrated running on an iPod. The standalone configuration supports browsing, searching and displaying documents in a range of media formats. Plugged in to a host computer (Mac, Linux, or Windows), the exact same facilities are made available to the world through a built-in web server.

    View record details
  • Detecting replay attacks in audiovisual identity verification

    Bredin, Herve; Miguel, Antonio; Witten, Ian H.; Chollet, Gerard (2006)

    Conference item
    University of Waikato

    We describe an algorithm that detects a lack of correspondence between speech and lip motion by detecting and monitoring the degree of synchrony between live audio and visual signals. It is simple, effective, and computationally inexpensive; providing a useful degree of robustness against basic replay attacks and against speech or image forgeries. The method is based on a cross-correlation analysis between two streams of features, one from the audio signal and the other from the image sequence. We argue that such an algorithm forms an effective first barrier against several kinds of replay attack that would defeat existing verification systems based on standard multimodal fusion techniques. In order to provide an evaluation mechanism for the new technique we have augmented the protocols that accompany the BANCA multimedia corpus by defining new scenarios. We obtain 0% equal-error rate (EER) on the simplest scenario and 35% on a more challenging one.

    View record details