Effective and efficient transformer models for sequential recommendation

Petrov, Aleksandr V. (2025) Effective and efficient transformer models for sequential recommendation. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2025petrovphd.pdf] PDF
Download (3MB)

Abstract

In the last decade, advances in natural language processing have driven significant interest in Deep Learning-based Sequential Recommendation Systems, as user-item interaction sequences resemble word sequences in language models. In particular, the arrival of the Transformer architecture transformed the field of sequential recommendation. It allowed Transformer-based models, such as BERT4Rec and SASRec, to achieve state-of-the-art results on many sequential recommendation problems. However, while these Transformer-based models perform well on small-scale academic datasets, they face challenges in real-life applications due to scalability problems and the complexity of modern recommendation goals, which include beyond-accuracy goals such as recommendation diversity. In this thesis, we closely examine the sources of these le solutions to enable Transformer-based models for large-scale, real-world deployments.

In particular, training sequential recommenders is problematic. Indeed, most recommendation datasets contain different sets of items, making the pre-training foundation models impossible and requiring training recommendation models from scratch for every new recommendation dataset. Long training is problematic because it increases running costs and causes delays in fresh data processing. In our reproducibility study, we find that practitioners often end up with underfit models due to the long training requirement. To tackle the long training problem, we propose Recency Sampling of Sequences (RSS), a novel training objective for sequential recommender systems that allows the achievement of strong results even when training time is limited. For example, on the MovieLens-20M dataset, RSS applied to the SASRec model can result in a 60% improvement in NDCG over a vanilla SASRec and a 16% improvement over a fully trained BERT4Rec model despite taking 93% less training time than BERT4Rec.

Another big challenge for Transformer-based Sequential Recommender Systems is a large catalogue of items that may be several orders of magnitudes larger when compared to the vocabularies of items. Large catalogues create the need for negative sampling during training, but in this thesis, we show that negative sampling causes effectiveness degradation. To mitigate this problem, we design a new gBCE loss, which counters the effects of negative sampling by down-weighting the contribution of the positive sample in the overall cost. We show that gBCE allows for state-of-the-art effectiveness with large catalogues, even with retaining negative sampling.

A large catalogue also makes the item embedding tensor large and model inference slow, as sequence embedding is multiplied by this large embedding tensor. On the large-scale Gowalla dataset, where training non-sampled models is infeasible due to large catalogue size, we obtain substantial improvements by enhancing SASRec with gBCE loss (+47%). We also reduce the memory footprint and speed up model inference using our proposed RecJPQ technique that atomic item IDs into compact compositional sub-item ID representation.

Building upon RecJPQ’s sub-item representations, we also address the problem of slow model inference with large catalogues. In particular, we propose two algorithms for fast item scoring. First, we propose the PQTopK algorithm, which computes item scores as the sum of the sub-item scores. Sub-item scores can be pre-computed and re-used between items, which results in up to 4.5× faster item scoring when compared to regular Transformer’s scoring. We further observe similarities between RecJPQ sub-item representation and bag-of words representations in Information Retrieval (IR). In IR, the problem of fast-scoring large collections of documents has been addressed using Dynamic Pruning approaches that allow finding Top-K items without scoring the whole catalogue exhaustively. Building upon the similarities between item representations in RecJPQ and document representations in IR, we propose the RecJPQPrune dynamic pruning algorithm for the RecJPQ-based recommenders. RecJPQPrune further improves scoring up time to 5.3× compared to PQTopK and up to 64× compared to regular Transformer’s scoring.

Finally, while existing Transformer-based models perform well when measured using accuracy-based ranking metrics (e.g. NDCG), they usually struggle to optimise more complex goals, such as increasing diversity or promoting popularity bias. To improve model effectiveness on these complex beyond-accuracy goals, we propose an autoregressive Next-K recommendation strategy as an alternative to the traditional ”score-and-rank approach”. We also propose a universal reinforcement learning-based alignment scheme for the Next-K strategy and show that it is possible to align a generative recommendation model with beyond-accuracy goals, such as diversity promotion. Our experiments on two datasets show that in 3 out of 4 cases, GPTRec’s Next-K generation approach offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques for diversity optimisation and decreasing popularity bias.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: MacDonald, Dr. Craig and Ounis, Professor Iadh
Date of Award: 2025
Depositing User: Theses Team
Unique ID: glathesis:2025-85270
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 30 Jun 2025 13:47
Last Modified: 30 Jun 2025 13:54
Thesis DOI: 10.5525/gla.thesis.85270
URI: https://theses.gla.ac.uk/id/eprint/85270
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year