OntolinkX: a context-aware linking approach integrating SapBERT and a cross-encoder reranker with hard negative training scenario for enhancing biomedical entity linking task

Wang, Gan (2025) OntolinkX: a context-aware linking approach integrating SapBERT and a cross-encoder reranker with hard negative training scenario for enhancing biomedical entity linking task. MSc(R) thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024WangMSc(R).pdf] PDF
Download (565kB)

Abstract

Biomedical Entity Linking (BEL), a crucial task in natural language processing, involves mapping mentions of biomedical entities in free text to their corresponding concepts in standardized and structured biomedical ontologies such as the Unified Medical Language System (UMLS). The increasing volume of biomedical literature and the complexity of medical terminologies present significant challenges for BEL, including entity ambiguity, dynamic knowledge bases, evolving terminology, and the need to maintain accuracy across diverse biomedical domain texts. Existing BEL systems often struggle with disambiguation, especially in the face of minimal context or sparse ontology descriptions, leading to reduced generalization ability in retrieval performance.

To address these challenges, we propose OntolinkX model, a context-aware linking approach that integrates SapBERT and a cross-encoder reranker using hard negative sampling scenarios. It builds on SapBERT which is a state-of-the-art entity linking approach that mainly focuses on synonym disambiguation and semantic alignment via contrastive learning but does not take full contexts into account. We show that adding a cross-encoder improves on SapBERT’s performance in entity linking tasks. We explored the impact of incorporating additional information into the representation of both mention text and ontology concepts, two essential components in entity linking tasks. We start by taking entity names to represent ontology entries, then progressively augment the representations with semantic types and definitions. On the mention side, we incorporate contextual information from surrounding tokens within a dynamic window size. Furthermore, we examine the combined effect of full contextualized mention representations and enriched ontology representations.

Our two-stage pipeline begins with SapBERT retrieving potential entity candidates for each mention text. In the second stage, a cross-encoder is trained with negative sampling learning approach, starting from randomly generated negative samples and progressing to challenging ”hard negatives”, which are closest incorrect candidates from the retriever. Experiments show that incorporating richer information from both mention context and ontology descriptions improves retrieval performance. These findings suggest that our OntolinkX linking approach, alongside enriched representations from hard negative sampling strategy, can substantially improve BEL in complex biomedical texts.

Item Type: Thesis (MSc(R))
Qualification Level: Masters
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Lever, Dr. Jake
Date of Award: 2025
Depositing User: Theses Team
Unique ID: glathesis:2025-85080
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 23 Apr 2025 08:04
Last Modified: 23 Apr 2025 08:09
Thesis DOI: 10.5525/gla.thesis.85080
URI: https://theses.gla.ac.uk/id/eprint/85080

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year