Topic difficulty and order of document presentation in relevance assessments

Damessie, T 2018, Topic difficulty and order of document presentation in relevance assessments, Doctor of Philosophy (PhD), Science, RMIT University.


Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Damessie.pdf Thesis application/pdf 5.59MB
Title Topic difficulty and order of document presentation in relevance assessments
Author(s) Damessie, T
Year 2018
Abstract A test collection is crucial to evaluate the relative effectiveness of an Information Retrieval (IR) system. Relevance assessments are central to this approach of IR system evaluation; and the accuracy of assessors judging relevance, therefore, will have direct implications to the generality of the conclusions drawn from evaluation metrics based on test collection. Comparing the degree of agreement between new judgements made by an assessor with already existing relevance assessments of the highest quality also known as gold standard judgements is commonly used to determine the accuracy of an assessor. Gold standard judgements are developed using subject experts who judge documents for their own information needs expressed in the form of topics. However, gold standard assessments are rarely available, and they are the most expensive to develop. As a result, gathering reliable relevance judgements through non-topic originators has become an increasingly important problem in relevance evaluation.

On the basis of the findings from previous studies, including our earlier works, we assert that the order of document presentation and the difficulty of a topic influences the accuracy of a judge quantified using agreement. In this work, we explore document presentation order and topic difficulty during a relevance assessment exercise with the aim of improving agreement and consequently reliability of relevance judgements. The study, therefore, makes contributions in the following three aspects.

First, TREC and NTCIR are two of the most widely known evaluation campaigns in IR. Leveraging document presentation order commonly used in these campaigns, we designed an experiment ordering documents using TREC assigned identifiers and using documents expected relevance. When compared with those who judged the exact documents in relevance order, we found agreement to be higher among assessors who judged documents ordered using the TREC identifiers.

Second, we extend our finding further to explore dwell time – the time an assessor takes to judge a document – as an implicit factor influencing relevance judgement. We proposed two measures of dwell time normalisation which take into account the differences in individual reading speed and the differences in the length of documents. We found that assessors can quickly identify non-relevant documents than relevant.

Third, motivated by the notions of topic difficulty– the difficulty of a topic to IR systems (system topic difficulty) and the difficulty of a topic to an assessor judging relevance (user topic difficulty); we set-up an experiment to study the relationship between the two. We found a weak agreement between the two notions of topic difficulty. Further analysis is then undertaken to select representative topics of both notions of difficulty. An experiment is designed using the selected topics to abet assessors, so they can more easily discriminate between relevant from non-relevant documents leading them to rapidly converge on a consistent notion of relevance during the assessment exercise. Several methods of document presentation ordering investigated. We then proposed a document presentation order that maximises agreement despite topic difficulty when compared with the orderings commonly used in the relevance assessment campaigns.

We expect that our proposed approaches provide ways to understand order and topic difficulty in relevance assessment and that the findings can be applied to maximise agreement as well as gauge the quality of relevance assessments and help gather reliable relevance assessments in situations where gold data are not available.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Science
Subjects Information Retrieval and Web Search
Keyword(s) Agreement
Order Effects
Topic Difficulty
Quality Assessments
Dwell Time
Assessor Reliability
IR Systems Evaluation
Versions
Version Filter Type
Access Statistics: 39 Abstract Views, 37 File Downloads  -  Detailed Statistics
Created: Wed, 27 Mar 2019, 12:48:51 EST by Adam Rivett
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us