A scalable system for identifying co-derivative documents

Bernstein, Y and Zobel, J 2004, 'A scalable system for identifying co-derivative documents', in A. Apostolico and M. Melucci (ed.) String Processing and Information Retrieval: 11th International Conference, SPIRE 2004, Padova, Italy, 7 December 2004, pp. 55-57.

Document type: Conference Paper
Collection: Conference Papers

Title A scalable system for identifying co-derivative documents
Author(s) Bernstein, Y
Zobel, J
Year 2004
Conference name International Conference on String Processing and Information Retrieval
Conference location Padova, Italy
Conference dates 7 December 2004
Proceedings title String Processing and Information Retrieval: 11th International Conference, SPIRE 2004
Editor(s) A. Apostolico
M. Melucci
Publisher Springer
Place of publication Berlin, Germany
Start page 55
End page 57
Total pages 3
Abstract Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEX, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying coderivative clusters, and describe DECO, a prototype system that makes use of SPEX. Our experiments with several document collections demonstrate the effectiveness of the approach.
Subjects Business Information Management (incl. Records, Knowledge and Information Management, and Intelligence)
Keyword(s) co-derivatives
document collection
search engines
DOI - identifier 10.1007/b100941
Copyright notice © Springer-Verlag Berlin Heidelberg 2004
ISBN 978-3-540-23210-0
Version Filter Type
Altmetric details:
Access Statistics: 204 Abstract Views  -  Detailed Statistics
Created: Wed, 08 Apr 2009, 09:42:32 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us