Skip to content Home Contact Mobile MyRMIT Library A-Z
RMIT UniversityResearch Repository
 

Accurate Discovery of Co-derivative Documents Via Duplicate Text Detection

Bernstein, Y and Zobel, J 2006, 'Accurate Discovery of Co-derivative Documents Via Duplicate Text Detection', Information Systems, vol. 31, pp. 595-609.

Document type: Journal Article
Collection: Journal Articles

Title Accurate Discovery of Co-derivative Documents Via Duplicate Text Detection
Author(s) Bernstein, Y
Zobel, J
Year 2006
Journal name Information Systems
Volume number 31
Start page 595
End page 609
Total pages 14
Publisher Pergamon
Abstract Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEx, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe DECO, a prototype package that combines the SPEX algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.
Subject Information Systems not elsewhere classified
Copyright notice Copyright © 2005 Elsevier B.V. All rights reserved
ISSN 0306-4379
 
Versions
Version Filter Type
Citation counts: Scopus Citation Count Cited 7 times in Scopus Article | Citations
Access Statistics: 81 Abstract Views  -  Detailed Statistics
Created: Wed, 18 Feb 2009, 09:53:18 EST by Catalyst Administrator