77 results for Witten, Ian H., Conference item

  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Exploring Wikipedia with Hōpara

    Milne, David N.; Witten, Ian H. (2011)

    Conference item
    University of Waikato

    Anyone who has browsed Wikipedia has likely experienced the feeling of being happily lost, browsing from one interesting topic to the next and encountering information that they would never have searched for explicitly. With some 3M articles and 70M links, Wikipedia represents an extreme example of large-scale hypertext. We consider it to be a rich and challenging platform for investigating navigation and disorientation in large interconnected information spaces. This demonstration showcases Hōpara, a new search engine that aims to make Wikipedia and its underlying link structure easier to explore. It works on top of the encyclopedia’s existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.

    View record details
  • Clustering documents with active learning using Wikipedia

    Huang, Anna; Witten, Ian H.; Frank, Eibe; Milne, David N. (2009)

    Conference item
    University of Waikato

    Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.

    View record details
  • Perambulating libraries: Demonstrating how a Victorian idea can help OLPC users share books

    Witten, Ian H.; Bainbridge, David (2011)

    Conference item
    University of Waikato

    In this extended abstract we detail how the open source digital library toolkit Greenstone [5] can help users of the XOlaptop— produced by the One Laptop Per Child Foundation— manage and share electronic documents. The idea draws upon mobile libraries (bookmobiles) for its inspiration, which first appeared in Victorian times. The implemented technique works by building on the Mesh network that is instrumental to the XO-laptop approach. To use the technique, on each portable XO-laptop a version of Greenstone is installed, allowing the owner to develop and manage their own set of books. The version of Greenstone has been adapted to support a form of interoperability we have called Digital Library Talkback. On the Mesh, when two XO-laptops “see” each other, the two users can search and browse the other user’s digital library; when they see a book they like, they can have it transferred to their library with a single click using the Digital Library Talkback mechanism.

    View record details
  • Greenstone: A comprehensive open-source digital library software system

    Witten, Ian H.; McNab, Rodger J.; Boddie, Stefan J.; Bainbridge, David (2000)

    Conference item
    University of Waikato

    This paper describes the Greenstone digital library software, a comprehensive, open-source system for the construction and presentation of information collections. Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are attractive and easy to use. Moreover, they are easily maintainable and can be augmented and rebuilt entirely automatically. The system is extensible: software "plugins" accommodate different document and metadata types.

    View record details
  • Greenstone digital library software: current research

    Bainbridge, David; Witten, Ian H. (2004)

    Conference item
    University of Waikato

    The Greenstone digital library software (www.greenstone.org)provides a flexible way of organizing information and publishing it on the Internet or removable media such as CDROM. Its aim is to empower users, particularly in universities, libraries and other public service institutions, to build their own digital libraries. It is open-source software, issued under the terms of the GNU General Public License. It is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.

    View record details
  • Using a permutation test for attribute selection in decision trees

    Frank, Eibe; Witten, Ian H. (1998)

    Conference item
    University of Waikato

    Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a well-founded basis for addressing the problem. However, many statistical tests are computed from a chi-squared distribution, which is only a valid approximation to the actural distribution in the large-sample case-and this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel two-stage method for applying it to select attributes in a decision tree. Results on practical datasets compare favourably with other methods that also adopt a pre-pruning strategy.

    View record details
  • Semantic document representation: Do It with Wikification

    Witten, Ian H. (2012)

    Conference item
    University of Waikato

    Wikipedia is a goldmine of information. Each article describes a single concept, and together they constitute a vast investment of manual effort and judgment. Wikification is the process of automatically augmenting a plain-text document with hyperlinks to Wikipedia articles. This involves associating phrases in the document with concepts, disambiguating them, and selecting the most pertinent. All three processes can be addressed by exploiting Wikipedia as a source of data. For the first, link anchor text illustrates how concepts are described in running text. For the second and third, Wikipedia provides millions of examples that can be used to prime machine-learned algorithms for disambiguation and selection respectively. Wikification produces a semantic representation of any document in terms of concepts. We apply this to (a) select index terms for scientific documents, and (b) determine the similarity of two documents, in both cases outperforming humans in terms of agreement with human judgment. I will show how it can be applied to document clustering and classification algorithms, and to produce back of the book indexes, improving on the state of the art in each case.

    View record details
  • Learning English with FLAX apps

    Yu, Alex; Witten, Ian H. (2015)

    Conference item
    University of Waikato

    The rise of Mobile Assisted Language Learning has brought a new dimension and dynamic into language classes. Game-like language learning apps have become a particularly effective way to promote self-learning outside classroom to young learners. This paper describes a system called FLAX that allows teachers to use their own material to build digital library collections that can then be used to create a variety of web and mobile based language games like Hangman, Scrambled Sentences, Split Sentences, Word Guessing, and Punctuation and Capitalization. These games can be easily downloaded on Android handheld systems such as phones and tablets, and are automatically updated whenever new materials are added by teachers through a web-based interface on the FLAX server

    View record details
  • Detecting replay attacks in audiovisual identity verification

    Bredin, Herve; Miguel, Antonio; Witten, Ian H.; Chollet, Gerard (2006)

    Conference item
    University of Waikato

    We describe an algorithm that detects a lack of correspondence between speech and lip motion by detecting and monitoring the degree of synchrony between live audio and visual signals. It is simple, effective, and computationally inexpensive; providing a useful degree of robustness against basic replay attacks and against speech or image forgeries. The method is based on a cross-correlation analysis between two streams of features, one from the audio signal and the other from the image sequence. We argue that such an algorithm forms an effective first barrier against several kinds of replay attack that would defeat existing verification systems based on standard multimodal fusion techniques. In order to provide an evaluation mechanism for the new technique we have augmented the protocols that accompany the BANCA multimedia corpus by defining new scenarios. We obtain 0% equal-error rate (EER) on the simplest scenario and 35% on a more challenging one.

    View record details
  • Browsing around a Digital Library: Today and Tomorrow

    Witten, Ian H. (2000)

    Conference item
    University of Waikato

    What will it be like to work in tomorrow’s digital library? We begin by browsing around an experimental digital library of the present, glancing at some collections and showing how they are organized. Then we look to the future. Although present digital libraries are quite like conventional libraries, we argue that future ones will feel qualitatively different. Readers- and writers- will work in the library using a kind of context-directed browsing. This will be supported by structures derived from automatic analysis of the contents of the library- not just the catalogue, or abstracts, but the full text of the books and journals- using new techniques of text mining.

    View record details
  • Greenstone: A platform for distributed digital library applications

    Bainbridge, David; Buchanan, George; McPherson, John R.; Jones, Steve; Mahoui, Abdelaziz; Witten, Ian H. (2001)

    Conference item
    University of Waikato

    This paper examines the issues surrounding distributed Digital Library protocols. First, it reviews three prominent digital library protocols: Z39.50, SDLIP, and Dienst, plus Greenstone’s own protocol. Then, we summarise the implementation in the Greenstone Digital Library of a number of different protocols for distributed digital libraries, and describe sample applications of the same: a digital library for children, a translator for Stanford’s Simple Digital Library Interoperability Protocol, a Z39.50 client, and a bibliographic search tool. The paper concludes with a comparison of all four protocols, and a brief discussion of the impact of distributed protocols on the Greenstone system.

    View record details
  • Practical digital library interoperability standards

    Bainbridge, David; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    As the field of digital libraries matures and new systems and standards develop, the ability to interoperate between systems becomes paramount. This tutorial gives a practical introduction to many recent standards and de facto standards for interoperability, and illustrates them using open source digital library software-including online demonstrations of interoperation issues and solutions. Core standards that are discussed include Dublin Core, OAI-PMH, METS, and MODS. We use interoperation between Greenstone and DSpace as a motivating case study. For those demonstrations that involve Greenstone, attendees who wish to may bring their laptops, install Greenstone from a CD-ROM that we will provide, along with various sample files, and follow along with the demonstrations on their own machine.

    View record details
  • Seeking information in realistic books: a user study

    Liesaputra, Veronica; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    There are opposing views on whether readers gain any advantage from using a computer model of a 3D physical book. There is enough evidence, both anecdotal and from formal user studies, to suggest that the usual HTML or PDF presentation of documents is not always the most convenient, or the most comfortable, for the reader. On the other hand it is quite clear that while 3D book models have been prototyped and demonstrated, none are in routine use in today's digital libraries. And how do 3D book models compare with actual books? This paper reports on a user study designed to compare the performance of a practical Realistic Book implementation with conventional formats (HTML and PDF) and with physical books. It also evaluates the annotation features that the implementation provides.

    View record details
  • A link-based visual search engine for Wikipedia

    Milne, David N.; Witten, Ian H. (2011)

    Conference item
    University of Waikato

    This paper introduces Hopara, a new search engine that aims to make Wikipedia easier to explore. It works on top of the encyclopedia's existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.

    View record details