75 results for Witten, Ian H., Conference item

  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Using a permutation test for attribute selection in decision trees

    Frank, Eibe; Witten, Ian H. (1998)

    Conference item
    University of Waikato

    Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a well-founded basis for addressing the problem. However, many statistical tests are computed from a chi-squared distribution, which is only a valid approximation to the actural distribution in the large-sample case-and this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel two-stage method for applying it to select attributes in a decision tree. Results on practical datasets compare favourably with other methods that also adopt a pre-pruning strategy.

    View record details
  • Creating and customizing digital library collections with the Greenstone Librarian Interface

    Witten, Ian H. (2004)

    Conference item
    University of Waikato

    The Greenstone digital library software is a comprehensive system for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet. This paper describes how digital library collections can be created and customized with the new Greenstone Librarian Interface. Its basic features allow users to add documents and metadata to collections, create new collections whose structure mirrors existing ones, and build collections and put them in place so for users to view. More advanced users can design and customize new collection structures. At the most advanced level, the Librarian Interface gives expert users interactive access to the full power of Greenstone, which could formerly be tapped only by running Perl scripts manually.

    View record details
  • Thesaurus-based index term extraction for agricultural documents

    Medelyan, Olena; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction.

    View record details
  • Semantic document representation: Do It with Wikification

    Witten, Ian H. (2012)

    Conference item
    University of Waikato

    Wikipedia is a goldmine of information. Each article describes a single concept, and together they constitute a vast investment of manual effort and judgment. Wikification is the process of automatically augmenting a plain-text document with hyperlinks to Wikipedia articles. This involves associating phrases in the document with concepts, disambiguating them, and selecting the most pertinent. All three processes can be addressed by exploiting Wikipedia as a source of data. For the first, link anchor text illustrates how concepts are described in running text. For the second and third, Wikipedia provides millions of examples that can be used to prime machine-learned algorithms for disambiguation and selection respectively. Wikification produces a semantic representation of any document in terms of concepts. We apply this to (a) select index terms for scientific documents, and (b) determine the similarity of two documents, in both cases outperforming humans in terms of agreement with human judgment. I will show how it can be applied to document clustering and classification algorithms, and to produce back of the book indexes, improving on the state of the art in each case.

    View record details
  • Digital library access for illiterate users

    Deo, Shaleen; Nichols, David M.; Cunningham, Sally Jo; Witten, Ian H.; Trujillo, Maria F. (2004)

    Conference item
    University of Waikato

    The problems that illiteracy poses in accessing information are gaining attention from the research community. Issues currently being explored include developing an understanding of the barriers to information acquisition experienced by different groups of illiterate information seekers; creating technology, such as software interfaces, that support illiterate users effectively; and tailoring content to increase its accessibility. We have taken a formative evaluation approach to developing and evaluating a digital library interface for illiterate users. We discuss modifications to the Greenstone platform, describe user studies and outline resulting design implications.

    View record details
  • Using Wikipedia for language learning

    Wu, Shaoqun; Witten, Ian H. (2015)

    Conference item
    University of Waikato

    Differentiating between words like look, see and watch, injury and wound, or broad and wide presents great challenges to language learners because it is the collocates of these words that reveal their different shades of meaning, rather than their dictionary definitions. This paper describes a system called FlaxCLS that overcomes the restrictions and limitations of the existing tools used for collocation learning. FlaxCLS automatically extracts useful syntactic-based word from three millions Wikipedia article and provides a simple interface through which learners seek collocations of any words, or search for combinations of multiple words. The system also retrieves semantically related words and collocations of the query term by consulting Wikipedia. FlaxCLS has been used as language support for many Masters and PhD students in a New Zealand university. Anecdotal evidence suggests that the interface it provides is easy to use and students have found it helpful in improving their written English.

    View record details
  • Learning English with FLAX apps

    Yu, Alex; Witten, Ian H. (2015)

    Conference item
    University of Waikato

    The rise of Mobile Assisted Language Learning has brought a new dimension and dynamic into language classes. Game-like language learning apps have become a particularly effective way to promote self-learning outside classroom to young learners. This paper describes a system called FLAX that allows teachers to use their own material to build digital library collections that can then be used to create a variety of web and mobile based language games like Hangman, Scrambled Sentences, Split Sentences, Word Guessing, and Punctuation and Capitalization. These games can be easily downloaded on Android handheld systems such as phones and tablets, and are automatically updated whenever new materials are added by teachers through a web-based interface on the FLAX server

    View record details
  • A mobile reader for language learners

    König, Jemma; Witten, Ian H.; Wu, Shaoqun (2016)

    Conference item
    University of Waikato

    This paper describes a new approach to mobile language learning; a mobile reader that aids learners in extending the breadth of their existing vocabulary knowledge. The FLAX Reader supports L2 (second language) learners in English by building a personalized learner model of receptive vocabulary acquisition. It provides dictionary lookup for words that they struggle with, tracks a learner's reading speed, and models their vocabulary acquisition, generat-ing appropriate exercises to aid in a learner’s personal language growth.

    View record details
  • Learning collocations with FLAX apps

    Yu, Alex; Wu, Shaoqun; Witten, Ian H.; König, Jemma (2016)

    Conference item
    University of Waikato

    The rise of Mobile Assisted Language Learning has brought a new dimension and dynamic into language classes. Game-like apps have become a particularly effective way to promote self-learning to young learners outside classroom. This paper describes a system called FLAX that allows teachers to automatically generate a variety of collocation games from a con-temporary collocation database built from Wikipedia text. These games are fun to play and mimic traditional classroom activities such as Collocation Matching, Collocation Guessing, Collocation Dominoes, and Related Words. The apps can be downloaded onto Android devices from the Google Play store, and exercises are automatically updated whenever new materials are added by teachers through a web-based interface on the FLAX server. Teachers have used these games to provide supplementary material for several Massive Open Online courses (MOOC) in Law discipline.

    View record details
  • Experiences with the Greenstone digital library software for international development

    Nichols, David M.; Rose, John; Bainbridge, David; Witten, Ian H. (2010)

    Conference item
    University of Waikato

    Greenstone is a versatile open source multilingual digital library environment, emerging from research on text compression within the New Zealand Digital Library Research Project in the Department of Computer Science at the University of Waikato. In 1997 we began to work with Human Info NGO to help them produce fully-searchable CD-ROM collections of humanitarian information. The software has since evolved to support a variety of application contexts. Rather than being simply a delivery mechanism, we have emphasised the empowerment of users to create and distribute their own digital collections.

    View record details
  • Modeling for optimal probability prediction

    Wang, Yong; Witten, Ian H. (2002)

    Conference item
    University of Waikato

    We present a general modelling method for optimal probability prediction over future observations, in which model dimensionality is determined as a natural by-product. This new method yields several estimators, and we establish theoretically that they are optimal (either overall or under stated restrictions) when the number of free parameters is infinite. As a case study, we investigate the problem of fitting logistic models in finite-sample situations. Simulation results on both artificial and practical datasets are supportive.

    View record details
  • Learning language using genetic algorithms

    Smith, Tony C.; Witten, Ian H. (1996)

    Conference item
    University of Waikato

    Strict pattern-based methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent well-formed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zero-frequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical n-grams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong. This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, fault-tolerant method for inferring grammars from positive examples of natural language.

    View record details
  • Greenstone: A comprehensive open-source digital library software system

    Witten, Ian H.; McNab, Rodger J.; Boddie, Stefan J.; Bainbridge, David (2000)

    Conference item
    University of Waikato

    This paper describes the Greenstone digital library software, a comprehensive, open-source system for the construction and presentation of information collections. Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are attractive and easy to use. Moreover, they are easily maintainable and can be augmented and rebuilt entirely automatically. The system is extensible: software "plugins" accommodate different document and metadata types.

    View record details
  • Determining progression in glaucoma using visual fields

    Turpin, Andrew; Frank, Eibe; Hall, Mark A.; Witten, Ian H.; Johnson, Chris A. (2001)

    Conference item
    University of Waikato

    The standardized visual field assessment, which measures visual function in 76 locations of the central visual area, is an important diagnostic tool in the treatment of the eye disease glaucoma. It helps determine whether the disease is stable or progressing towards blindness, with important implications for treatment. Automatic techniques to classify patients based on this assessment have had limited success, primarily due to the high variability of individual visual field measurements. The purpose of this paper is to describe the problem of visual field classification to the data mining community, and assess the success of data mining techniques on it. Preliminary results show that machine learning methods rival existing techniques for predicting whether glaucoma is progressing—though we have not yet been able to demonstrate improvements that are statistically significant. It is likely that further improvement is possible, and we encourage others to work on this important practical data mining problem.

    View record details