Identifying latent relationship information in documents for efficient and effective sensitivity review

Narvala, Hitarth (2024) Identifying latent relationship information in documents for efficient and effective sensitivity review. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024narvalaphd.pdf] PDF
Download (6MB)

Abstract

Freedom of Information (FOI) laws exist in over a hundred countries to ensure public access to information that is held by government and public institutions. However, the FOI laws exempt the public disclosure of sensitive information (e.g. personal or confidential information) that can violate the human rights of individuals or endanger a country’s national security. Hence, government documents must undergo a rigorous sensitivity review before the documents can be considered for public release. Sensitivity review is typically a manual process since it requires utmost accuracy to ensure that potentially sensitive information is protected from public release. However, due to the massive volume of government documents that must be sensitivity reviewed, it is impractical to conduct a fully manual sensitivity review. Moreover, identifying sensitive information itself is a complex task, which often requires analysing hidden patterns or connections, i.e., latent relations between documents, such as mentions of specific individuals or descriptions of events, activities or discussions that could span multiple documents.

In this thesis, we argue that automatically identifying latent relations between documents can help the human users involved in the sensitivity review process to efficiently make accurate sensitivity judgements. In particular, we identify two user roles in the sensitivity review process, namely Review Organisers and Sensitivity Reviewers. Review Organisers prioritise and allocate documents for review to maximise openness, i.e., the number of documents selected for public release in a fixed time. Sensitivity Reviewers read the documents to determine whether they contain sensitive information. This thesis aims to address the following challenges in the respective tasks of the Review Organisers and Sensitivity Reviewers: (1) effectively prioritising documents for review to increase openness, (2) effectively allocating documents to reviewers based on their specific interests in different types of documents and content, and (3) accurately and efficiently identifying sensitive information by analysing latent relations between documents.

In this thesis, we propose novel methods for automatically identifying the latent relations between documents to assist both Review Organisers and Sensitivity Reviewers. We first propose, RelDiff, a method for representing knowledge graph entities and relations in a single embedding space, which can improve the effectiveness of automatic sensitivity classification. Through empirical evaluation, we show that representing entire entity-relation-entity triples (e.g. personIsDirectorOf-company) can effectively indicate whether a piece of information (e.g. a person’s salary) should be considered sensitive or non-sensitive. We then propose to leverage document clustering to identify semantic categories that describe a high-level subject domain (e.g. criminality or politics). Through an extensive user study, we show that presenting documents in semantic categories can help the reviewers understand the type of content in a collection, thereby improving the reviewing speed of reviewers without affecting the accuracy of sensitivity review. Moreover, we show that prioritising semantic categories using sensitivity classification can help the Review Organisers release more documents in a fixed time (i.e. increase openness). Furthermore, we introduce the task of information threading, i.e., to identify coherent and chronologically evolving information about an event, activity or discussion from multiple documents. We propose novel information threading methods (i.e., SeqINT and HINT) and demonstrate their effectiveness through empirical evaluations compared to existing related methods. In addition, through a detailed user study, we show that reviewing documents in information threads can help the reviewers provide sensitivity judgements more quickly and accurately compared to a traditional document-by-document review. Lastly, we propose to learn the reviewers’ interests in specific types of documents to effectively allocate documents based on the reviewers’ interests and expertise. We propose, CluRec, a method for cluster-based recommendation that can effectively identify and recommend clusters of documents that are related based on the users’ interests. Through another comprehensive user study, we show that recommending documents to reviewers based on their interests can improve the reviewers’ reviewing speed and the review accuracy.

Overall, we present a novel framework for sensitivity review, SERVE, that harnesses our proposed methods of identifying latent relations and provides a series of functionalities to the Sensitivity Reviewers and Review Organisers, namely: (1) Sequentially reviewing documents that are organised into semantic categories, to enable the quick and consistent review of similar documents. (2) Collectively reviewing related documents in coherent threads, to enable accurate and efficient review of sensitivities that are spread across multiple documents. (3) Customised prioritisation of documents for review based on the documents’ semantic categories and predicted sensitivity probabilities to enhance openness. (4) Recommending documents to reviewers based on their interests to effectively allocate documents to reviewers who are best equipped to understand and identify sensitive information in specific types of documents and content in a collection.

This is the first thesis that takes a system-oriented approach and investigates different novel functionalities to assist human sensitivity review. Our primary contributions in this thesis are our proposed framework for sensitivity review, SERVE, and its underlying methods to identify latent relations between documents that are potential indicators of sensitive information. Our extensive experiments and evaluations, involving thorough offline experiments and carefully designed user studies, demonstrate the real-world applicability of SERVE in enhancing the ability of government organisations to fulfil their openness obligations while protecting sensitive information to comply with FOI laws. In addition, we demonstrate the applications of our proposed novel methods for information threading and cluster-based recommendation beyond sensitivity review, i.e., in the news domain, which emphasises the generalisability of our contributions.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Ounis, Professor Iadh and McDonald, Dr Graham
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84317
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 14 May 2024 09:40
Last Modified: 14 May 2024 09:43
Thesis DOI: 10.5525/gla.thesis.84317
URI: https://theses.gla.ac.uk/id/eprint/84317

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year