Estimating measurement uncertainty for information retrieval effectiveness metrics

Moffat, A, Scholer, F and Yang, Z 2018, 'Estimating measurement uncertainty for information retrieval effectiveness metrics', Journal of Data and Information Quality, vol. 10, no. 3, pp. 1-22.


Document type: Journal Article
Collection: Journal Articles

Title Estimating measurement uncertainty for information retrieval effectiveness metrics
Author(s) Moffat, A
Scholer, F
Yang, Z
Year 2018
Journal name Journal of Data and Information Quality
Volume number 10
Issue number 3
Start page 1
End page 22
Total pages 22
Publisher Association for Computing Machinery
Abstract One typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d and then form relevance judgments for those documents only. Non-pooled documents-ones that did not appear in the top-d sets of any of the contributing systems-are then deemed to be non-relevant for the purposes of evaluating the relative behavior of the systems. In this article, we use RBP-derived residuals to re-examine the reliability of that process. By fitting the RBP parameter phi to maximize similarity between AP- and NDCG-induced system rankings, on the one hand, and RBP-induced rankings, on the other, an estimate can be made as to the potential score uncertainty associated with those two recall-based metrics. We then consider the effect that residual size as an indicator of possible measurement uncertainty in utility-based metrics-has in connection with recall-based metrics by computing the effect of increasing pool sizes and examining the trends that arise in terms of both metric score and system separability using standard statistical tests. The experimental results show that the confidence levels expressed via the p-values generated by statistical tests are only weakly connected to the size of the residual and to the degree of measurement uncertainty caused by the presence of unjudged documents. Statistical confidence estimates are, however, largely consistent as pooling depths are altered. We therefore recommend that all such experimental results should report, in addition to the outcomes of statistical significance tests, the residual measurements generated by a suitably matched weighted-precision metric, to give a clear indication of measurement uncertainty that arises due to the presence of unjudged documents in test collections with finite pooled judgments.
Subject Information Retrieval and Web Search
Keyword(s) Effectiveness metric
Evaluation
Evaluation
Information retrieval
Statistical test
Test collection
DOI - identifier 10.1145/3239572
Copyright notice © 2018 Association for Computing Machinery.
ISSN 1936-1955
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in Thomson Reuters Web of Science Article
Scopus Citation Count Cited 0 times in Scopus Article
Altmetric details:
Access Statistics: 12 Abstract Views  -  Detailed Statistics
Created: Thu, 21 Feb 2019, 12:10:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us