Cahsai, Atoshum Samuel (2020) Scaling kNN queries using statistical learning. PhD thesis, University of Glasgow.
Full text available as:

PDF
Download (5MB)  Preview 
Abstract
The kNearest Neighbour (kNN) method is a fundamental building block for many sophisticated statistical learning models and has a wide application in different fields; for instance, in kNN regression, kNN classification, multidimensional items search, locationbased services, spatial analytics, etc.
However, nowadays with the unprecedented spread of data generated by computing and communicating devices has resulted in a plethora of lowdimensional largescale datasets and their users' community, the need for efficient and scalable kNN processing is pressing. To this end, several parallel and distributed approaches and methodologies for processing exact kNN in lowdimensional largescale datasets have been proposed; for example HadoopMapReducebased kNN query processing approaches such as SpatialHadoop (SHadoop), and Sparkbased approaches like Simba. This thesis contributes with a variety of methodologies for kNN query processing based on statistical and machine learning techniques over largescale datasets.
This study investigates the exact kNN query performance behaviour of the wellknown Big Data Systems, SHadoop and Simba, that proposes building multidimensional Global and Local Indexes over low dimensional largescale datasets. The rationale behind such methods is that when executing exact kNN query, the Global and Local indexes access a small subset of a largescale dataset stored in a distributed file system. The Global Index is used to prune out irrelevant subsets of the dataset; while the multiple distributed Local Indexes are used to prune out unnecessary data elements of a partition (subset).
The kNN execution algorithm of SHadoop and Simba involves loading data elements that reside in the relevant partitions from disks/network points to memory. This leads to significantly high kNN query response times; so, such methods are not suitable for lowlatency applications and services. An extensive literature review showed that not enough attention has been given to access relatively smallsized but relevant data using kNN query only. Based on this limitation, departing from the traditional kNN query processing methods, this thesis contributes two novel solutions: Coordinator With Index (COWI) and Coordinator with No Index(CONI) approaches. The essence of both approaches rests on adopting a coordinatorbased distributed processing algorithm and a way to structure computation and index the stored datasets that ensures that only a very small number of pieces of data are retrieved from the underlying data centres, communicated over the network, and processed by the coordinator for every kNN query. The expected outcome is that scalability is ensured and kNN queries can be processed in just tens of milliseconds. Both approaches are implemented using a NoSQL Database (HBase) achieving up to three orders of magnitude of performance gain compared with state of the art methods SHadoop and Simba.
It is common practice that the current stateoftheart approaches for exact kNN query processing in lowdimensional space use Treebased multidimensional Indexing methods to prune out irrelevant data during query processing. However, as data sizes continue to increase, (nowadays it is not uncommon to reach several Petabytes), the storage cost of Treebased Index methods becomes exceptionally high, especially when opted to partition a dataset into smaller chunks. In this context, this thesis contributes with a novel perspective on how to organise lowdimensional largescale datasets based on data space transformations deriving a Space Transformation Organisation Structure (STOS). STOS facilitates kNN query processing as if underlying datasets were uniformly distributed in the space. Such an approach bears significant advantages: first, STOS enjoys a minute memory footprint that is many orders of magnitude smaller than Indexbased approaches found in the literature. Second, the required memory for such metadata information over largescale datasets, unlike related work, increases very slowly with dataset size. Hence, STOS enjoys significantly higher scalability. Third, STOS is relatively efficient to compute, outperforming traditional multivariate Index building times, and comparable, if not better, query response times.
In the literature, the exact kNN query in a largescale dataset was limited to lowdimensional space; this is because the query response time and memory space requirement of the Treebased index methods increase with dimension. Unable to solve such exponential dependency on the dimension, researchers assume that no efficient solution exists and propose approximation kNN in high dimensional space. Unlike the approximated kNN query that tries to retrieve approximated nearest neighbours from largescale datasets, in this thesis a new type of kNN query referred to as ‘estimated kNN query’ is proposed. The estimated kNN query processing methodology attempts to estimate the nearest neighbours based on the marginal cumulative distribution of underlying data using statistical copulas. This thesis showcases the performance tradeoff of exact kNN and the estimate kNN queries in terms of estimation error and scalability. In contrast, kNN regression predicts that a value of a target variable based on kNN; but, particularly in a high dimensional largescale dataset, a query response time of kNN regression, can be a significantly high due to the curse of dimensionality. In an effort to tackle this issue, a new probabilistic kNN regression method is proposed. The proposed method statistically predicts the values of a target variable of kNN without computing distance.
In different contexts, a kNN as missing value algorithm in high dimensional space in Pytha, a distributed/parallel missing value imputation framework, is investigated. In Pythia, a different way of indexing a highdimensional largescale dataset is proposed by the group (not the work of the author of this thesis); by using such indexing methods, scalingout of kNN in high dimensional space was ensured. Pythia uses Adaptive Resonance Theory (ART) a machine learning clustering algorithm for building a data digest (aka signatures) of largescale datasets distributed across several data machines. The major idea is that given an input vector, Pythia predicts the most relevant data centres to get involved in processing, for example, kNN. Pythia does not retrieve exact kNN. To this end, instead of accessing the entire dataset that resides in a datanode, in this thesis, accessing only relevant clusters that reside in appropriate datanodes is proposed. As we shall see later, such method has comparable accuracy to that of the original design of Pythia but has lower imputation time. Moreover, the imputation time does not significantly grow with a size of a dataset that resides in a data node or with the number of data nodes in Pythia. Furthermore, as Pythia depends utterly on the data digest built by ART to predict relevant data centres, in this thesis, the performance of Pythia is investigated by comparing different signatures constructed by a different clustering algorithms, the SelfOrganising Maps.
In this thesis, the performance advantages of the proposed approaches via extensive experimentation with multidimensional real and synthetic datasets of different sizes and context are substantiated and quantified.
Item Type:  Thesis (PhD) 

Qualification Level:  Doctoral 
Keywords:  scaling k nearest neighbours, big data, indexing multidimensional data, large scale dataset, machine learning, HBase, Hadoop, Spark, Copulas, Pythia, Probabilistic Data Space Transformations, estimated kNN, approximate kNN regression, Adaptive resonance theory, Self Organising Maps, missing value imputation, gaussian mixture model, Sklar’s theorem, cumulative distribution function. 
Subjects:  H Social Sciences > HA Statistics Q Science > QA Mathematics > QA75 Electronic computers. Computer science 
Colleges/Schools:  College of Science and Engineering > School of Computing Science 
Supervisor's Name:  Anagnostopoulos, Doctor Christos 
Date of Award:  2020 
Depositing User:  MR Atoshum Samuel Cahsai 
Unique ID:  glathesis:202081523 
Copyright:  Copyright of this thesis is held by the author. 
Date Deposited:  31 Jul 2020 08:12 
Last Modified:  31 Jul 2020 08:16 
URI:  http://theses.gla.ac.uk/id/eprint/81523 
Actions (login required)
View Item 