User variability and IR system evaluation

Bailey, P, Moffat, A, Scholer, F and Thomas, P 2015, 'User variability and IR system evaluation', in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015), Santiago, Chile, 9-13 August 2015, pp. 625-634.


Document type: Conference Paper
Collection: Conference Papers

Title User variability and IR system evaluation
Author(s) Bailey, P
Moffat, A
Scholer, F
Thomas, P
Year 2015
Conference name SIGIR 2015
Conference location Santiago, Chile
Conference dates 9-13 August 2015
Proceedings title Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015)
Publisher Association for Computing Machinery
Place of publication New York, United States
Start page 625
End page 634
Total pages 10
Abstract Test collection design eliminates sources of user variability to make statistical comparisons among information retrieval (IR) systems more affordable. Does this choice unnecessarily limit generalizability of the outcomes to real usage scenarios? We explore two aspects of user variability with regard to evaluating the relative performance of IR systems, assessing effectiveness in the context of a subset of topics from three TREC collections, with the embodied information needs categorized against three levels of increasing task complexity. First, we explore the impact of widely differing queries that searchers construct for the same information need description. By executing those queries, we demonstrate that query formulation is critical to query effectiveness. The results also show that the range of scores characterizing effectiveness for a single system arising from these queries is comparable or greater than the range of scores arising from variation among systems using only a single query per topic. Second, our experiments reveal that searchers display substantial individual variation in the numbers of documents and queries they anticipate needing to issue, and there are underlying significant differences in these numbers in line with increasing task complexity levels. Our conclusion is that test collection design would be improved by the use of multiple query variations per topic, and could be further improved by the use of metrics which are sensitive to the expected numbers of useful documents.
Subjects Information Retrieval and Web Search
Keyword(s) Relevance measures
Test collections
User behavior
ISBN 9781450336215
Versions
Version Filter Type
Citation counts: Scopus Citation Count Cited 25 times in Scopus Article | Citations
Access Statistics: 215 Abstract Views  -  Detailed Statistics
Created: Mon, 04 Apr 2016, 11:53:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us