News vertical search using user-generated content

McCreadie, Richard (2012) News vertical search using user-generated content. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2012mccreadiephd.pdf] PDF
Download (6MB)

Abstract

The thesis investigates how content produced by end-users on the World Wide Web — referred to
as user-generated content — can enhance the news vertical aspect of a universal Web search engine,
such that news-related queries can be satisfied more accurately, comprehensively and in a more timely
manner. We propose a news search framework to describe the news vertical aspect of a universal web
search engine. This framework is comprised of four components, each providing a different piece of
functionality. The Top Events Identification component identifies the most important events that are
happening at any given moment using discussion in user-generated content streams. The News Query
Classification component classifies incoming queries as news-related or not in real-time. The Ranking
News-Related Content component finds and ranks relevant content for news-related user queries from
multiple streams of news and user-generated content. Finally, the News-Related Content Integration
component merges the previously ranked content for the user query into theWeb search ranking. In this
thesis, we argue that user-generated content can be leveraged in one or more of these components to
better satisfy news-related user queries. Potential enhancements include the faster identification of news
queries relating to breaking news events, more accurate classification of news-related queries, increased
coverage of the events searched for by the user or increased freshness in the results returned.
Approaches to tackle each of the four components of the news search framework are proposed,
which aim to leverage user-generated content. Together, these approaches form the news vertical component
of a universal Web search engine. Each approach proposed for a component is thoroughly
evaluated using one or more datasets developed for that component. Conclusions are derived concerning
whether the use of user-generated content enhances the component in question using an appropriate
measure, namely: effectiveness when ranking events by their current importance/newsworthiness for the
Top Events Identification component; classification accuracy over different types of query for the News
Query Classification component; relevance of the documents returned for the Ranking News-Related
Content component; and end-user preference for rankings integrating user-generated content in comparison
to the unalteredWeb search ranking for the News-Related Content Integration component. Analysis of the proposed approaches themselves, the effective settings for the deployment of those approaches
and insights into their behaviour are also discussed.
In particular, the evaluation of the Top Events Identification component examines how effectively
events — represented by newswire articles — can be ranked by their importance using two different
streams of user-generated content, namely blog posts and Twitter tweets. Evaluation of the proposed
approaches for this component indicates that blog posts are an effective source of evidence to use when
ranking events and that these approaches achieve state-of-the-art effectiveness. Using the same approaches
instead driven by a stream of tweets, provide a story ranking performance that is significantly
more effective than random, but is not consistent across all of the datasets and approaches tested. Insights
are provided into the reasons for this with regard to the transient nature of discussion in Twitter.
Through the evaluation of the News Query Classification component, we show that the use of timely
features extracted from different news and user-generated content sources can increase the accuracy
of news query classification over relying upon newswire provider streams alone. Evidence also suggests
that the usefulness of the user-generated content sources varies as news events mature, with some
sources becoming more influential over time as new content is published, leading to an upward trend in
classification accuracy.
The Ranking News-Related Content component evaluation investigates how to effectively rank content
from the blogosphere and Twitter for news-related user queries. Of the approaches tested, we show
that learning to rank approaches using features specific to blog posts/tweets lead to state-of-the-art ranking
effectiveness under real-time constraints.
Finally this thesis demonstrates that the majority of end-users prefer rankings integrated with usergenerated
content for news-related queries to rankings containing only Web search results or integrated
with only newswire articles. Of the user-generated content sources tested, the most popular source is
shown to be Twitter, particularly for queries relating to breaking events.
The central contributions of this thesis are the introduction of a news search framework, the approaches
to tackle each of the four components of the framework that integrate user-generated content
and their subsequent evaluation in a simulated real-time setting. This thesis draws insights from a broad
range of experiments spanning the entire search process for news-related queries. The experiments reported
in this thesis demonstrate the potential and scope for enhancements that can be brought about by
the leverage of user-generated content for real-time news search and related applications.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: News Vertical Search, Real-time Search, Web Search, Social Media, User-generated content, Event Identification, Query Classification, Federated Search, Crowdsourcing
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Ounis, Dr. Iadh
Date of Award: 2012
Depositing User: Mrs Marie Cairney
Unique ID: glathesis:2012-3813
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 21 Dec 2012 09:28
Last Modified: 30 Jan 2024 12:43
URI: https://theses.gla.ac.uk/id/eprint/3813

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year