Supervised learning for detection of duplicates in genomic sequence databases

Chen, Q, Zobel, J, Zhang, X and Verspoor, K 2016, 'Supervised learning for detection of duplicates in genomic sequence databases', PLoS ONE, vol. 11, no. 8, e0159644, pp. 1-20.

Document type: Journal Article
Collection: Journal Articles

Attached Files
Name Description MIMEType Size
n2006069250.pdf Published Version application/pdf 1.28MB
Title Supervised learning for detection of duplicates in genomic sequence databases
Author(s) Chen, Q
Zobel, J
Zhang, X
Verspoor, K
Year 2016
Journal name PLoS ONE
Volume number 11
Issue number 8
Article Number e0159644
Start page 1
End page 20
Total pages 20
Publisher Public Library of Science
Abstract Motivation First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. Results We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from metadata, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.
Subject Pattern Recognition and Data Mining
DOI - identifier 10.1371/journal.pone.0159644
Copyright notice © 2016 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cred.
ISSN 1932-6203
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 4 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 3 times in Scopus Article | Citations
Altmetric details:
Access Statistics: 95 Abstract Views, 31 File Downloads  -  Detailed Statistics
Created: Thu, 05 Jan 2017, 07:51:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us