The accurate prediction of disordered regions in protein sequences using machine learning approaches

Han, P 2011, The accurate prediction of disordered regions in protein sequences using machine learning approaches, Doctor of Philosophy (PhD), Computer Science and Information Technology, RMIT University.


Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Han.pdf Thesis application/pdf 4.14MB
Title The accurate prediction of disordered regions in protein sequences using machine learning approaches
Author(s) Han, P
Year 2011
Abstract A major challenge in the post-genome era is to determine the function of proteins. The traditional structure-function paradigm assumes that the function of a protein is contingent on it folding into a stable three-dimensional structure. However many proteins contain intrinsic unstructured or Disordered Regions (DRs) under physiological conditions, and yet they still carry important functions. Determination of the disordered regions in proteins is therefore an important step towards the determination of their functions. Traditional experimental approaches are generally time consuming and expensive. The efficient and cost-effective computer aided automatic prediction of DRs is thus an attractive alternative. To this end, we propose the novel application of machine learning models and physicochemical features extracted from protein sequences for predicting long, short and global disorder in proteins.

To improve the understandability of disorder prediction, rule based predictors are proposed, which are not only able to predict DRs, but can also quantify previously unknown associations between order disorder status and sequences. The prediction process is transparent and simple to explain.

As DRs of different lengths possess different properties, to achieve a high accuracy of prediction, we propose predictors specific to long, short and global disorder prediction. These predictors are distinct from each other in terms of their features, the machine learning models used, and the methods of prediction. We thoroughly investigate the database of physicochemical properties of amino acid indices and select the indices most correlated with disorder.

Based on these properties, novel feature transforms including autocorrelation and wavelet transforms (WTs) are applied to DR prediction. According to the results of cross-validation tests, our long DR predictor based on autocorrelation achieves the highest accuracy of prediction among long DR predictors at an AUC (Area Under ROC Curve) value of 89.5%. A short DR predictor based on WTs achieves an AUC value of 88.7%, which is comparable to the most accurate short DR predictors. The global DR predictor achieves an AUC value of 96.1%, close to the optimal value.

A major bottleneck of large scale DR prediction is the time efficiency constraint that is attributed to slow feature generation stages and complicated prediction methods. Both our long and short DR predictors are built from simple methods of prediction and feature space. Our web service for long DR prediction can process an uploaded file of multiple sequences.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Computer Science and Information Technology
Keyword(s) disordered regions
intrinsically unstructured
prediction
machine learning
protein sequence
decision tree
random forest
support vector machine
wavelet transform
Versions
Version Filter Type
Access Statistics: 324 Abstract Views, 410 File Downloads  -  Detailed Statistics
Created: Thu, 01 Mar 2012, 16:07:41 EST by Guy Aron
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us