Wang, Gan (2025) OntolinkX: a context-aware linking approach integrating SapBERT and a cross-encoder reranker with hard negative training scenario for enhancing biomedical entity linking task. MSc(R) thesis, University of Glasgow.
Full text available as:![]() |
PDF
Download (565kB) |
Abstract
Biomedical Entity Linking (BEL), a crucial task in natural language processing, involves mapping mentions of biomedical entities in free text to their corresponding concepts in standardized and structured biomedical ontologies such as the Unified Medical Language System (UMLS). The increasing volume of biomedical literature and the complexity of medical terminologies present significant challenges for BEL, including entity ambiguity, dynamic knowledge bases, evolving terminology, and the need to maintain accuracy across diverse biomedical domain texts. Existing BEL systems often struggle with disambiguation, especially in the face of minimal context or sparse ontology descriptions, leading to reduced generalization ability in retrieval performance.
To address these challenges, we propose OntolinkX model, a context-aware linking approach that integrates SapBERT and a cross-encoder reranker using hard negative sampling scenarios. It builds on SapBERT which is a state-of-the-art entity linking approach that mainly focuses on synonym disambiguation and semantic alignment via contrastive learning but does not take full contexts into account. We show that adding a cross-encoder improves on SapBERT’s performance in entity linking tasks. We explored the impact of incorporating additional information into the representation of both mention text and ontology concepts, two essential components in entity linking tasks. We start by taking entity names to represent ontology entries, then progressively augment the representations with semantic types and definitions. On the mention side, we incorporate contextual information from surrounding tokens within a dynamic window size. Furthermore, we examine the combined effect of full contextualized mention representations and enriched ontology representations.
Our two-stage pipeline begins with SapBERT retrieving potential entity candidates for each mention text. In the second stage, a cross-encoder is trained with negative sampling learning approach, starting from randomly generated negative samples and progressing to challenging ”hard negatives”, which are closest incorrect candidates from the retriever. Experiments show that incorporating richer information from both mention context and ontology descriptions improves retrieval performance. These findings suggest that our OntolinkX linking approach, alongside enriched representations from hard negative sampling strategy, can substantially improve BEL in complex biomedical texts.
Item Type: | Thesis (MSc(R)) |
---|---|
Qualification Level: | Masters |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Colleges/Schools: | College of Science and Engineering > School of Computing Science |
Supervisor's Name: | Lever, Dr. Jake |
Date of Award: | 2025 |
Depositing User: | Theses Team |
Unique ID: | glathesis:2025-85080 |
Copyright: | Copyright of this thesis is held by the author. |
Date Deposited: | 23 Apr 2025 08:04 |
Last Modified: | 23 Apr 2025 08:09 |
Thesis DOI: | 10.5525/gla.thesis.85080 |
URI: | https://theses.gla.ac.uk/id/eprint/85080 |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year