77 results for Witten, Ian H., Conference item

  • Learning structure from sequences, with applications in a digital library

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.

    View record details
  • Examples of practical digital libraries: collections built internationally using Greenstone

    Witten, Ian H. (2002)

    Conference item
    University of Waikato

    Although the field of digital libraries is still young, digital library collections have been built around the world and are being deployed on numerous public web sites. But what is a digital library, exactly? In many respects the best way to characterize the notion is by extension, in terms of actual examples, rather than by intension as in a conventional definition. In a very real sense, digital libraries are whatever people choose to call by the term “digital library.”

    View record details
  • Compression and full-text indexing for Digital Libraries

    Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1995)

    Conference item
    University of Waikato

    This chapter has demonstrated the feasibility of full-text indexing of large information bases. The use of modern compression techniques means that there is no space penalty: large document databases can be compressed and indexed in less than a third of the space required by the originals. Surprisingly, there is little or no time penalty either: querying can be faster because less information needs to be read from disk. Simple queries can be answered in a second; more complex ones with more query terms may take a few seconds. One important application is the creation of static databases on CD-ROM, and a 1.5 gigabyte document database can be compressed onto a standard 660 megabyte CD-ROM. Creating a compressed and indexed document database containing hundreds of thousands of documents and gigabytes of data takes a few hours. Whereas retrieval can be done on ordinary workstations, creation requires a machine with a fair amount of main memory.

    View record details
  • Detecting sequential structure

    Nevill-Manning, Craig G.; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

    View record details
  • Learning to describe data in actions

    Maulsby, David; Witten, Ian H. (1995)

    Conference item
    University of Waikato

    Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

    View record details
  • Compressing semi-structured text using hierarchical phrase identification

    Nevill-Manning, Craig G.; Witten, Ian H.; Olsen, Dan R., Jr. (1996)

    Conference item
    University of Waikato

    Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

    View record details
  • Domain-specific keyphrase extraction

    Frank, Eibe; Paynter, Gordon W.; Witten, Ian H.; Gutwin, Carl; Nevill-Manning, Craig G. (1999)

    Conference item
    University of Waikato

    Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naive Bayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly when domain-specific information is exploited.

    View record details
  • Topic indexing with Wikipedia

    Medelyan, Olena; Witten, Ian H.; Milne, David N. (2008)

    Conference item
    University of Waikato

    Wikipedia article names can be utilized as a controlled vocabulary for identifying the main topics in a document. Wikipedia’s 2M articles cover the terminology of nearly any document collection, which permits controlled indexing in the absence of manually created vocabularies. We combine state-of-art strategies for automatic controlled indexing with Wikipedia’s unique property- a richly hyperlinked encyclopedia. We evaluated the scheme by comparing automatically assigned topics with those chosen manually by human indexers. Analysis of indexing consistency shows that our algorithm outperforms some human subjects.

    View record details
  • Measuring inter-indexer consistency using a thesaurus

    Medelyan, Olena; Witten, Ian H. (2006)

    Conference item
    University of Waikato

    When professional indexers independently assign terms to a given document, the term sets generally differ between indexers. Studies of inter-indexer consistency measure the percentage of matching index terms, but none of them consider the semantic relationships that exist amongst these terms. We propose to represent multiple-indexers data in a vector space and use the cosine metric as a new consistency measure that can be extended by semantic relations between index terms. We believe that this new measure is more accurate and realistic than existing ones and therefore more suitable for evaluation of automatically extracted index terms.

    View record details
  • How to turn the page

    Chu, Yi-Chun; Witten, Ian H.; Lobb, Richard; Bainbridge, David (2003)

    Conference item
    University of Waikato

    Can digital libraries provide a reading experience that more closely resembles a real book than a scrolled or paginated electronic display? This paper describes a prototype page-turning system that realistically animates full three-dimensional page-turns. The dynamic behavior is generated by a mass-spring model defined on a rectangular grid of particles. The prototype takes a PDF or E-book file, renders it into a sequence of PNG images representing individual pages, and animates the pageturns under user control. The simulation behaves fairly naturally, although more computer graphics work is required to perfect it.

    View record details
  • Mining Domain-Specific Thesauri from Wikipedia: A case study

    Milne, David N.; Medelyan, Olena; Witten, Ian H. (2006)

    Conference item
    University of Waikato

    Domain-specific thesauri are high-cost, high-maintenance, high-value knowledge structures. We show how the classic thesaurus structure of terms and links can be mined automatically from Wikipedia. In a comparison with a professional thesaurus for agriculture we find that Wikipedia contains a substantial proportion of its concepts and semantic relations; furthermore it has impressive coverage of contemporary documents in the domain. Thesauri derived using our techniques capitalize on existing public efforts and tend to reflect contemporary language usage better than their costly, painstakingly-constructed manual counterparts.

    View record details
  • Second language learning in the context of MOOCs

    Wu, Shaoqun; Fitzgerald, Alannah; Witten, Ian H. (2014)

    Conference item
    University of Waikato

    Massive Open Online Courses are becoming popular educational vehicles through which universities reach out to non-traditional audiences. Many enrolees hail from other countries and cultures, and struggle to cope with the English language in which these courses are invariably offered. Moreover, most such learners have a strong desire and motivation to extend their knowledge of academic English, particularly in the specific area addressed by the course. Online courses provide a compelling opportunity for domain-specific language learning. They supply a large corpus of interesting linguistic material relevant to a particular area, including supplementary images (slides), audio and video. We contend that this corpus can be automatically analysed, enriched, and transformed into a resource that learners can browse and query in order to extend their ability to understand the language used, and help them express themselves more fluently and eloquently in that domain. To illustrate this idea, an existing online corpus-based language learning tool (FLAX) is applied to a Coursera MOOC entitled Virology 1: How Viruses Work, offered by Columbia University.

    View record details
  • Privacy preserving computation by fragmenting individual bits and distributing gates

    Will, Mark A.; Ko, Ryan K.L.; Witten, Ian H. (2016)

    Conference item
    University of Waikato

    Solutions that allow the computation of arbitrary operations over data securely in the cloud are currently impractical. The holy grail of cryptography, fully homomorphic encryption, still requires minutes to compute a single operation. In order to provide a practical solution, this paper proposes taking a different approach to the problem of securely processing data. FRagmenting Individual Bits (FRIBs), a scheme which preserves user privacy by distributing bit fragments across many locations, is presented. Privacy is maintained by each server only receiving a small portion of the actual data, and solving for the rest results in a vast number of possibilities. Functions are defined with NAND logic gates, and are computed quickly as the performance overhead is shifted from computation to network latency. This paper details our proof of concept addition algorithm which took 346ms to add two 32-bit values-paving the way towards further improvements to get computations completed under 100ms.

    View record details
  • An effective, low-cost measure of semantic relatedness obtained from Wikipedia links

    Witten, Ian H.; Milne, David N. (2008)

    Conference item
    University of Waikato

    This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Out approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.

    View record details
  • A knowledge-based search engine powered by Wikipedia

    Milne, David N.; Witten, Ian H.; Nichols, David M. (2007)

    Conference item
    University of Waikato

    This paper describes Koru, a new search interface that offers effective domain-independent knowledge-based information retrieval. Koru exhibits an understanding of the topics of both queries and documents. This allows it to (a) expand queries automatically and (b) help guide the user as they evolve their queries interactively. Its understanding is mined from the vast investment of manual effort and judgment that is Wikipedia. We show how this open, constantly evolving encyclopedia can yield inexpensive knowledge structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We conducted a detailed user study with 12 participants and 10 topics from the 2005 TREC HARD track, and found that Koru and its underlying knowledge base offers significant advantages over traditional keyword search. It was capable of lending assistance to almost every query issued to it; making their entry more efficient, improving the relevance of the documents they return, and narrowing the gap between expert and novice seekers.

    View record details
  • Human-competitive tagging using automatic keyphrase extraction

    Medelyan, Olena; Frank, Eibe; Witten, Ian H. (2009)

    Conference item
    University of Waikato

    This paper connects two research areas: automatic tagging on the web and statistical keyphrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and further improve performance in this new domain using a new algorithm, “Maui”, that utilizes semantic information extracted from Wikipedia. Maui outperforms existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers.

    View record details
  • One-Class Classification by Combining Density and Class Probability Estimation

    Hempstalk, Kathryn; Frank, Eibe; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    One-class classification has important applications such as outlier and novelty detection. It is commonly tackled using density estimation techniques or by adapting a standard classification algorithm to the problem of carving out a decision boundary that describes the location of the target data. In this paper we investigate a simple method for one-class classification that combines the application of a density estimator, used to form a reference distribution, with the induction of a standard model for class probability estimation. In this method, the reference distribution is used to generate artificial data that is employed to form a second, artificial class. In conjunction with the target class, this artificial class is the basis for a standard two-class learning problem. We explain how the density function of the reference distribution can be combined with the class probability estimates obtained in this way to form an adjusted estimate of the density function of the target class. Using UCI datasets, and data from a typist recognition problem, we show that the combined model, consisting of both a density estimator and a class probability estimator, can improve on using either component technique alone when used for one-class classification. We also compare the method to one-class classification using support vector machines.

    View record details
  • Learning to link with Wikipedia

    Milne, David N.; Witten, Ian H. (2008)

    Conference item
    University of Waikato

    This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Out approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.

    View record details
  • Thesaurus based automatic keyphrase indexing

    Medelyan, Olena; Witten, Ian H. (2006)

    Conference item
    University of Waikato

    We propose a new method that enhances automatic keyphrase extraction by using semantic information on terms and phrases gleaned from a domain-specific thesaurus. We evaluate the results against keyphrase sets assigned by a state-of-the-art keyphrase extraction system and those assigned by six professional indexers.

    View record details
  • Determining progression in glaucoma using visual fields

    Turpin, Andrew; Frank, Eibe; Hall, Mark A.; Witten, Ian H.; Johnson, Chris A. (2001)

    Conference item
    University of Waikato

    The standardized visual field assessment, which measures visual function in 76 locations of the central visual area, is an important diagnostic tool in the treatment of the eye disease glaucoma. It helps determine whether the disease is stable or progressing towards blindness, with important implications for treatment. Automatic techniques to classify patients based on this assessment have had limited success, primarily due to the high variability of individual visual field measurements. The purpose of this paper is to describe the problem of visual field classification to the data mining community, and assess the success of data mining techniques on it. Preliminary results show that machine learning methods rival existing techniques for predicting whether glaucoma is progressing—though we have not yet been able to demonstrate improvements that are statistically significant. It is likely that further improvement is possible, and we encourage others to work on this important practical data mining problem.

    View record details