Comparative analysis of sequence clustering methods for deduplication of biological databases

Chen, Q, Wan, Y, Zhang, J, Lei, Y, Zobel, J and Verspoor, K 2018, 'Comparative analysis of sequence clustering methods for deduplication of biological databases', Journal of Data and Information Quality, vol. 9, no. 3, pp. 1-27.


Document type: Journal Article
Collection: Journal Articles

Title Comparative analysis of sequence clustering methods for deduplication of biological databases
Author(s) Chen, Q
Wan, Y
Zhang, J
Lei, Y
Zobel, J
Verspoor, K
Year 2018
Journal name Journal of Data and Information Quality
Volume number 9
Issue number 3
Start page 1
End page 27
Total pages 27
Publisher Association for Computing Machinery
Abstract The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.
Subject Database Management
Keyword(s) Clustering
Databases
Deduplication
Validation
DOI - identifier 10.1145/3131611
Copyright notice © 2018 ACM
ISSN 1936-1955
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in Thomson Reuters Web of Science Article
Scopus Citation Count Cited 0 times in Scopus Article
Altmetric details:
Access Statistics: 5 Abstract Views  -  Detailed Statistics
Created: Tue, 23 Oct 2018, 16:00:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us