76 results for Witten, Ian H., Conference item

  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Domain-specific keyphrase extraction

    Frank, Eibe; Paynter, Gordon W.; Witten, Ian H.; Gutwin, Carl; Nevill-Manning, Craig G. (1999)

    Conference item
    University of Waikato

    Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naive Bayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly when domain-specific information is exploited.

    View record details
  • Semantic document representation: Do It with Wikification

    Witten, Ian H. (2012)

    Conference item
    University of Waikato

    Wikipedia is a goldmine of information. Each article describes a single concept, and together they constitute a vast investment of manual effort and judgment. Wikification is the process of automatically augmenting a plain-text document with hyperlinks to Wikipedia articles. This involves associating phrases in the document with concepts, disambiguating them, and selecting the most pertinent. All three processes can be addressed by exploiting Wikipedia as a source of data. For the first, link anchor text illustrates how concepts are described in running text. For the second and third, Wikipedia provides millions of examples that can be used to prime machine-learned algorithms for disambiguation and selection respectively. Wikification produces a semantic representation of any document in terms of concepts. We apply this to (a) select index terms for scientific documents, and (b) determine the similarity of two documents, in both cases outperforming humans in terms of agreement with human judgment. I will show how it can be applied to document clustering and classification algorithms, and to produce back of the book indexes, improving on the state of the art in each case.

    View record details
  • Digital library access for illiterate users

    Deo, Shaleen; Nichols, David M.; Cunningham, Sally Jo; Witten, Ian H.; Trujillo, Maria F. (2004)

    Conference item
    University of Waikato

    The problems that illiteracy poses in accessing information are gaining attention from the research community. Issues currently being explored include developing an understanding of the barriers to information acquisition experienced by different groups of illiterate information seekers; creating technology, such as software interfaces, that support illiterate users effectively; and tailoring content to increase its accessibility. We have taken a formative evaluation approach to developing and evaluating a digital library interface for illiterate users. We discuss modifications to the Greenstone platform, describe user studies and outline resulting design implications.

    View record details
  • Learning language using genetic algorithms

    Smith, Tony C.; Witten, Ian H. (1996)

    Conference item
    University of Waikato

    Strict pattern-based methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent well-formed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zero-frequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical n-grams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong. This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, fault-tolerant method for inferring grammars from positive examples of natural language.

    View record details
  • A fedora librarian interface

    Bainbridge, David; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    The Fedora content management system embodies a powerful and flexible digital object model. This paper describes a new open-source software front-end that enables end-user librarians to transfer documents and metadata in a variety of formats into a Fedora repository. The main graphical facility that Fedora itself provides for this task operates on one document at a time and is not librarian-friendly. A batch driven alternative is possible, but requires documents to be converted beforehand into the XML format used by the repository, necessitating a need for programming skills. In contrast, our new scheme allows arbitrary collections of documents residing on the user's computer (or the web at large) to be ingested into a Fedora repository in one operation, without a need for programming expertise. Provision is also made for editing existing documents and metadata, and adding new ones. The documents can be in a wide variety of different formats, and the user interface is suitable for practicing librarians. The design capitalizes on our experience in building the Greenstone librarian interface and participating in dozens of workshops with librarians worldwide.

    View record details
  • A user evaluation of hierarchical phrase browsing

    Edgar, Katrina D.; Nichols, David M.; Paynter, Gordon W.; Thomson, Kirsten; Witten, Ian H. (2003)

    Conference item
    University of Waikato

    Phrase browsing interfaces based on hierarchies of phrases extracted automatically from document collections offer a useful compromise between automatic full-text searching and manually-created subject indexes. The literature contains descriptions of such systems that many find compelling and persuasive. However, evaluation studies have either been anecdotal, or focused on objective measures of the quality of automatically-extracted index terms, or restricted to questions of computational efficiency and feasibility. This paper reports on an empirical, controlled user study that compares hierarchical phrase browsing with full-text searching over a range of information seeking tasks. Users found the results located via phrase browsing to be relevant and useful but preferred keyword searching for certain types of queries. Users experiences were marred by interface details, including inconsistencies between the phrase browser and the surrounding digital library interface.

    View record details
  • Digital libraries: developing countries, universal access, and information for all

    Witten, Ian H. (2004)

    Conference item
    University of Waikato

    Digital libraries are large, organized collections of information objects. Well-designed digital library software has the potential to enable non-specialist people to conceive, assemble, build, and disseminate new information collections. This has great social import because, by democratizing information dissemination, it provides a counterbalance to disturbing commercialization initiatives in the information and entertainment industries. This talk reviews trends in todays information environment, introduces digital library technology, and explores applications of digital libraries—including their use for disseminating humanitarian information in developing countries. We illustrate how currently available technology empowers users to build and publish information collections. Making digital libraries open to all, as conventional public libraries are, presents interesting challenges of universal access.

    View record details
  • How the dragons work: searching in a web

    Witten, Ian H. (2006)

    Conference item
    University of Waikato

    Search engines -- "web dragons" -- are the portals through which we access society's treasure trove of information. They do not publish the algorithms they use to sort and filter information, yet how they work is one of the most important questions of our time. Google's PageRank is a way of measuring the prestige of each web page in terms of who links to it: it reflects the experience of a surfer condemned to click randomly around the web forever. The HITS technique distinguishes "hubs" that point to reputable sources from "authorities," the sources themselves. This helps differentiate communities on the web, which in turn can tease out alternative interpretations of ambiguous query terms. RankNet uses machine learning techniques to rank documents by predicting relevance judgments based on training data. This article explains in non-technical terms how the dragons work.

    View record details
  • Thesaurus-based index term extraction for agricultural documents

    Medelyan, Olena; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction.

    View record details
  • Using Wikipedia for language learning

    Wu, Shaoqun; Witten, Ian H. (2015)

    Conference item
    University of Waikato

    Differentiating between words like look, see and watch, injury and wound, or broad and wide presents great challenges to language learners because it is the collocates of these words that reveal their different shades of meaning, rather than their dictionary definitions. This paper describes a system called FlaxCLS that overcomes the restrictions and limitations of the existing tools used for collocation learning. FlaxCLS automatically extracts useful syntactic-based word from three millions Wikipedia article and provides a simple interface through which learners seek collocations of any words, or search for combinations of multiple words. The system also retrieves semantically related words and collocations of the query term by consulting Wikipedia. FlaxCLS has been used as language support for many Masters and PhD students in a New Zealand university. Anecdotal evidence suggests that the interface it provides is easy to use and students have found it helpful in improving their written English.

    View record details
  • Learning English with FLAX apps

    Yu, Alex; Witten, Ian H. (2015)

    Conference item
    University of Waikato

    The rise of Mobile Assisted Language Learning has brought a new dimension and dynamic into language classes. Game-like language learning apps have become a particularly effective way to promote self-learning outside classroom to young learners. This paper describes a system called FLAX that allows teachers to use their own material to build digital library collections that can then be used to create a variety of web and mobile based language games like Hangman, Scrambled Sentences, Split Sentences, Word Guessing, and Punctuation and Capitalization. These games can be easily downloaded on Android handheld systems such as phones and tablets, and are automatically updated whenever new materials are added by teachers through a web-based interface on the FLAX server

    View record details
  • A new framework for building digital library collections

    Buchanan, George; Bainbridge, David; Don, Katherine J.; Witten, Ian H. (2005)

    Conference item
    University of Waikato

    This paper introduces a new framework for building digital library collections and contrasts it with existing systems. It describes a significant new step in the development of a widely-used open-source digital library system, Greenstone, which has evolved over many years. It is supported by a fresh implementation, which forced us to rethink the entire design rather than making incremental improvements. The redesign capitalizes on the best ideas from the existing system, which have been refined and developed to open new avenues through which digital librarians can tailor their collections. We demonstrate its flexibility by showing how digital library collections can be extended and altered to satisfy new requirements.

    View record details
  • Managing change in a digital library system with many interface languages

    Bainbridge, David; Edgar, Katrina D.; Witten, Ian H.; McPherson, John R. (2003)

    Conference item
    University of Waikato

    Managing the organizational and software complexity of a comprehensive open source digital library system presents a significant challenge. The challenge becomes even more imposing when the interface is available in different languages, for enhancements to the software and changes to the interface must be faithfully reflected in each language version. This paper describes the solution adopted by Greenstone, a multilingual digital library system distributed by UNESCO in a trilingual European version (English, French, Spanish), complete with all documentation, and whose interface is available in many further languages. Greenstone incorporates a language translation facility which allows authorized people to update the interface in specified languages. A standard version control system is used to manage software change, and from this the system automatically determines which language fragments need updating and presents them to the human translator.

    View record details
  • Token identification using HMM and PPM models

    Wen, Yingying; Witten, Ian H.; Wang, Dianhui (2003)

    Conference item
    University of Waikato

    Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language processing tasks including learning-based token identification. Most of the existing systems are domain- and language-dependent. The power of retargetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on token identification. We implement a system that bridges the two well known methods through words new to the identification model. The system is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only required input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with language-dependent components, our proposed system has power to deal with large scope of domain- and language-independent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people s name with correct tokens of 68% for TCC, and 76% for BIB.

    View record details