5,589 results for Conference item

  • Improving access to large-scale digital libraries through semantic-enhanced search and disambiguation

    Hinze, Annika; Taube-Schock, Craig; Bainbridge, David; Matamua, Rangi; Downie, J. Stephen (2015)

    Conference item
    University of Waikato

    With 13,000,000 volumes comprising 4.5 billion pages of text, it is currently very difficult for scholars to locate relevant sets of documents that are useful in their research from the HathiTrust Digital Libary (HTDL) using traditional lexically-based retrieval techniques. Existing document search tools and document clustering approaches use purely lexical analysis, which cannot address the inherent ambiguity of natural language. A semantic search approach offers the potential to overcome the shortcoming of lexical search, but even if an appropriate network of ontologies could be decided upon it would require a full semantic markup of each document. In this paper, we present a conceptual design and report on the initial implementation of a new framework that affords the benefits of semantic search while minimizing the problems associated with applying existing semantic analysis at scale. Our approach avoids the need for complete semantic document markup using pre-existing ontologies by developing an automatically generated Concept-in-Context (CiC) network seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system analyzes documents by the semantics and context of their content. The disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. Our method achieves a form of semantic-enhanced search that simultaneously exploits the proven scale benefits provided by lexical indexing.

    View record details
  • The impact of slow steaming on refrigerated exports from New Zealand

    Carson, James K.; Kemp, R.M.; East, A.R.; Cleland, D.J. (2015)

    Conference item
    University of Waikato

    The practice of slow steaming has had a significant impact on New Zealand export industries with increased transit times in some cases causing significant reductions in shelf life once the product has reached the retail stage. The longer transit times also impose the extra cost to exporters of having more inventory tied up in transit. While there is clear evidence to suggest slow steaming has reduced fuel consumption and hence fuel emissions and fuel costs, these savings have not been passed on by the liners to their customers. However, there is no indication that slow-steaming has caused a significant reduction in export earnings for New Zealand (at least up to the middle of 2014). A predicted move to super-slow steaming would put extra strain on the New Zealand meat industry especially, with their lucrative European chilled lamb market under particular threat.

    View record details
  • Progger: an efficient, tamper-evident kernel-space logger for cloud data provenance tracking

    Ko, Ryan K.L.; Will, Mark A. (2014)

    Conference item
    University of Waikato

    Cloud data provenance, or "what has happened to my data in the cloud", is a critical data security component which addresses pressing data accountability and data governance issues in cloud computing systems. In this paper, we present Progger (Provenance Logger), a kernel-space logger which potentially empowers all cloud stakeholders to trace their data. Logging from the kernel space empowers security analysts to collect provenance from the lowest possible atomic data actions, and enables several higher-level tools to be built for effective end-to-end tracking of data provenance. Within the last few years, there has been an increasing number of proposed kernel space provenance tools but they faced several critical data security and integrity problems. Some of these prior tools' limitations include (1) the inability to provide log tamper-evidence and prevention of fake/manual entries, (2) accurate and granular timestamp synchronisation across several machines, (3) log space requirements and growth, and (4) the efficient logging of root usage of the system. Progger has resolved all these critical issues, and as such, provides high assurance of data security and data activity audit. With this in mind, the paper will discuss these elements of high-assurance cloud data provenance, describe the design of Progger and its efficiency, and present compelling results which paves the way for Progger being a foundation tool used for data activity tracking across all cloud systems.

    View record details
  • Browsing a digital library: A new approach for the New Zealand digital library

    McKay, Dana; Cunningham, Sally Jo (2003)

    Conference item
    University of Waikato

    Browsing is part of the information seeking process, used when information needs are ill-defined or unspecific. Browsing and searching are often interleaved during information seeking to accommodate changing awareness of information needs. Digital Libraries often support full-text search, but are not so helpful in supporting browsing. Described here is a novel browsing system created for the Greenstone software used by the New Zealand Digital Library that supports users in a more natural approach to the information seeking process. © Springer-Verlag Berlin Heidelberg 2003.

    View record details
  • Scaling anisotropy of the power in parallel and perpendicular components of the solar wind magnetic field

    Forman, Miriam A.; Wicks, Robert T.; Horbury, Timothy S.; Oughton, Sean (2013)

    Conference item
    University of Waikato

    Power spectra of the components of the magnetic field parallel (Pzz) and perpendicular (Pzz+Pyy) to the local mean magnetic field direction were determined by wavelet methods from Ulysses’ MAG instrument data during eighteen 10-day segments of its first North Polar pass at high latitude at solar minimum in 1995. The power depends on frequency f and the angle θ between the solar wind direction and the local mean field, and with distance from the Sun. This data includes the solar wind whose total power (Pxx + Pyy + Pzz) in magnetic fluctuations we previously reported depends on f and the angle θ nearly as predicted by the GS95 critical balance model of strong incompressible MHD turbulence. Results at much wider range of frequencies during six evenly-spaced 10-day periods are presented here to illustrate the variability and evolution with distance from the Sun. Here we investigate the aniso tropic scaling of Pzz(f,θ) in particular because it is a reduced form of the Poloidal (pseudo-Alfvenic) component of the (incompressible) fluctuations. We also report the much larger Pxx(f,θ)+Pyy(f,θ) which is (mostly) reduced from the Toroidal (Alfvenic, i.e., perpendicular to both B and k) fluctuations, and comprises most of the total power. These different components of the total power evolve and scale differently in the inertial range. We compare these elements of the magnetic power spectral tensor with “critical balance” model predictions.

    View record details
  • Compositional synthesis of discrete event systems using synthesis abstraction

    Mohajerani, Sahar; Malik, Robi; Ware, Simon; Fabian, Martin (2011)

    Conference item
    University of Waikato

    This paper proposes a general method to synthesize a least restrictive supervisor for a large discrete event system model, consisting of a large number of arbitrary automata representing the plants and specifications. A new type of abstraction called synthesis abstraction is introduced and three rules are proposed to calculate an abstraction of a given automaton. Furthermore, a compositional algorithm for synthesizing a supervisor for large-scale systems of composed finite-state automata is proposed. In the proposed algorithm, the synchronous composition is computed step by step and intermediate results are simplified according to synthesis abstraction. Then a supervisor for the abstracted system is calculated, which in combination with the original system gives the least restrictive, nonblocking, and controllable behaviour.

    View record details
  • Constructing a focused taxonomy from a document collection

    Medelyan, Olena; Manion, Steve; Broekstra, Jeen; Divoli, Anna; Huang, Anna-Lan; Witten, Ian H. (2013)

    Conference item
    University of Waikato

    We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain.

    View record details
  • Reverse greed in energy and transport

    Kingham, S.; Muir, S. (2016)

    Conference item
    University of Canterbury Library

    View record details
  • Predicting polycyclic aromatic hydrocarbon concentrations in soil and water samples

    Holmes, Geoffrey; Fletcher, Dale; Reutemann, Peter (2010)

    Conference item
    University of Waikato

    Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide reliable prediction of compound concentrations. A workflow engine with an easy-to-use graphical user interface is at the heart of processing the data. This engine allows a domain expert to set up workflows that can load the data, preprocess it in parallel in various ways and convert it into data suitable for data mining toolkits. The generated output can then be evaluated using different data mining techniques, to determine the impact of preprocessing steps on the performance of the generated models and for picking the best approach. Encouraging results for predicting PAH compound concentrations, in terms of correlation coefficients and root-mean-squared error are demonstrated.

    View record details
  • Yet another approach to compositional synthesis of discrete event systems

    Malik, Robi; Flordal, Hugo (2008)

    Conference item
    University of Waikato

    A two-pass algorithm for compositional synthesis of modular supervisors for large-scale systems of composed finite-state automata is proposed. The first pass provides an efficient method to determine whether a supervisory control problem has a solution, without explicitly constructing the synchronous composition of all components. If a solution exists, the second pass yields an over-approximation of the least restrictive solution which, if nonblocking, is a modular representation of the least restrictive supervisor. Using a new type of equivalence of nondeterministic processes, called synthesis equivalence, a wide range of abstractions can be employed to mitigate state-space explosion throughout the algorithm.

    View record details
  • New ensemble methods for evolving data streams

    Bifet, Albert; Holmes, Geoffrey; Pfahringer, Bernhard; Kirkby, Richard Brendon; Gavaldà, Ricard (2009)

    Conference item
    University of Waikato

    Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change completely, is becoming one of the core issues. When tackling non-stationary concepts, ensembles of classifiers have several advantages over single classifier methods: they are easy to scale and parallelize, they can adapt to change quickly by pruning under-performing parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. This paper proposes a new experimental data stream framework for studying concept drift, and two new variants of Bagging: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. Using the new experimental framework, an evaluation study on synthetic and real-world datasets comprising up to ten million examples shows that the new ensemble methods perform very well compared to several known methods.

    View record details
  • Fast perceptron decision tree learning from evolving data streams

    Bifet, Albert; Holmes, Geoffrey; Pfahringer, Bernhard; Frank, Eibe (2010)

    Conference item
    University of Waikato

    Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees—Hoeffding Trees with naive Bayes models at the leaf nodes—albeit with increased runtime compared to standard Hoeffding Trees. In this paper, we show that runtime can be reduced by replacing naive Bayes with perceptron classifiers, while maintaining highly competitive accuracy. We also show that accuracy can be increased even further by combining majority vote, naive Bayes, and perceptrons. We evaluate four perceptron-based learning strategies and compare them against appropriate baselines: simple perceptrons, Perceptron Hoeffding Trees, hybrid Naive Bayes Perceptron Trees, and bagged versions thereof. We implement a perceptron that uses the sigmoid activation function instead of the threshold activation function and optimizes the squared error, with one perceptron per class value. We test our methods by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

    View record details
  • Greenstone digital library software: current research

    Bainbridge, David; Witten, Ian H. (2004)

    Conference item
    University of Waikato

    The Greenstone digital library software (www.greenstone.org)provides a flexible way of organizing information and publishing it on the Internet or removable media such as CDROM. Its aim is to empower users, particularly in universities, libraries and other public service institutions, to build their own digital libraries. It is open-source software, issued under the terms of the GNU General Public License. It is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.

    View record details
  • Handling numeric attributes in Hoeffding trees

    Pfahringer, Bernhard; Holmes, Geoffrey; Kirkby, Richard Brendon (2008)

    Conference item
    University of Waikato

    For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into pre-defined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multi-class tree-based classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.

    View record details
  • Mining Arbitrarily Large Datasets Using Heuristic k-Nearest Neighbour Search

    Wu, Xing; Holmes, Geoffrey; Pfahringer, Bernhard (2008)

    Conference item
    University of Waikato

    Nearest Neighbour Search (NNS) is one of the top ten data mining algorithms. It is simple and effective but has a time complexity that is the product of the number of instances and the number of dimensions. When the number of dimensions is greater than two there are no known solutions that can guarantee a sublinear retrieval time. This paper describes and evaluates two ways to make NNS efficient for datasets that are arbitrarily large in the number of instances and dimensions. The methods are best described as heuristic as they are neither exact nor approximate. Both stem from recent developments in the field of data stream classification. The first uses Hoeffding Trees, an extension of decision trees to streams and the second is a direct stream extension of NNS. The methods are evaluated in terms of their accuracy and the time taken to find the neighbours. Results show that the methods are competitive with NNS in terms of accuracy but significantly faster.

    View record details
  • An effective, low-cost measure of semantic relatedness obtained from Wikipedia links

    Witten, Ian H.; Milne, David N. (2008)

    Conference item
    University of Waikato

    This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Out approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.

    View record details
  • Leveraging bagging for evolving data streams

    Bifet, Albert; Holmes, Geoffrey; Pfahringer, Bernhard (2010)

    Conference item
    University of Waikato

    Bagging, boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They obtain superior performance by increasing the accuracy and diversity of the single classifiers. Attempts have been made to reproduce these methods in the more challenging context of evolving data streams. In this paper, we propose a new variant of bagging, called leveraging bagging. This method combines the simplicity of bagging with adding more randomization to the input, and output of the classifiers. We test our method by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

    View record details
  • Mining Domain-Specific Thesauri from Wikipedia: A case study

    Milne, David N.; Medelyan, Olena; Witten, Ian H. (2006)

    Conference item
    University of Waikato

    Domain-specific thesauri are high-cost, high-maintenance, high-value knowledge structures. We show how the classic thesaurus structure of terms and links can be mined automatically from Wikipedia. In a comparison with a professional thesaurus for agriculture we find that Wikipedia contains a substantial proportion of its concepts and semantic relations; furthermore it has impressive coverage of contemporary documents in the domain. Thesauri derived using our techniques capitalize on existing public efforts and tend to reflect contemporary language usage better than their costly, painstakingly-constructed manual counterparts.

    View record details
  • A Comparison of the BTT and TTF Test-Generation Methods

    Legeard, Bruno; Peureux, Fabien; Utting, Mark (2002)

    Conference item
    University of Waikato

    This paper compares two methods of generating tests from formal specifications. The Test Template Framework (TTF) method is a framework and set of heuristics for manually generating test sets from a Z specification. The B Testing Tools (BTT) method uses constraint logic programming techniques to generate test sequences from a B specification. We give a concise description of each method, then compare them on an industrial case study, which is a subset of the GSM 11.11 smart card specification.

    View record details
  • Stress- testing Hoeffding trees

    Holmes, Geoffrey; Kirkby, Richard Brendon; Pfahringer, Bernhard (2005)

    Conference item
    University of Waikato

    Hoeffding trees are state-of-the-art in classification for data streams. They perform prediction by choosing the majority class at each leaf. Their predictive accuracy can be increased by adding Naive Bayes models at the leaves of the trees. By stress-testing these two prediction methods using noise and more complex concepts and an order of magnitude more instances than in previous studies, we discover situations where the Naive Bayes method outperforms the standard Hoeffding tree initially but is eventually overtaken. The reason for this crossover is determined and a hybrid adaptive method is proposed that generally outperforms the two original prediction methods for both simple and complex concepts as well as under noise.

    View record details