Part of speech N-grams for information retrieval

Lioma, Christina Amalia (2008) Part of speech N-grams for information retrieval. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2008LiomaPhD.pdf] PDF
Download (97MB)
Printed Thesis Information:


The increasing availability of information on the World Wide Web (Web), and the need to access relevant specs of this information provide an important impetus for the development of automatic intelligent Information Retrieval (IR) technology. IR systems convert human authored language into representations that can be processed by computers, with the aim to provide humans with access to knowledge. Specifically, IR applications locate and quantify informative content in data, and make statistical decisions on the topical similarity, or relevance, between different items of data. The wide popularity of IR applications in the last decades has driven intensive research and development into theoretical models of information and relevance, and their implementation into usable applications, such as commercial search engines.

The majority of IR systems today typically rely on statistical manipulations of individual lexical frequencies (i.e., single word counts) to estimate
the relevance of a document to a user request, on the assumption that such lexical statistics can be sufficiently representative of informative content. Such estimations implicitly assume that words occur independently of each other, and as such ignore the compositional semantics of language. This assumption however is not entirely true, and can cause several problems, such as ambiguity in understanding textual information, misinterpreting or falsifying the original informative intent, and limiting the semantic scope of text. These problems can hinder the accurate estimation of relevance between texts, and hence harm the performance of an IR application.

This thesis investigates the use of non-lexical statistics by IR models, with the goal to enhance the estimation of relevance between a document and a user request. These non-lexical statistics consist of part of speech information. The parts of speech are the grammatical classes of words (e.g., noun, verb). Part of speech statistics are modelled in the form of part of speech (POS) n-grams, which are contiguous sequences of parts of speech, extracted from text.

The distribution of POS n-grams in language is statistically analysed. It is shown that there exists a relationship between the frequency and informative content of POS n-grams. Based on this, different applications of POS n-grams to IR technology are described and evaluated with state of the art systems. Experimental results show that POS n-grams can assist the retrieval process.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: Information Retrieval, Computational Linguistics, Natural Language Processing
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Van Rijsbergen, Prof. C.J.
Date of Award: 2008
Depositing User: Christina Amalia Lioma
Unique ID: glathesis:2008-340
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 16 Jul 2008
Last Modified: 10 Dec 2012 13:17

Actions (login required)

View Item View Item


Downloads per month over past year