Recovery-oriented software architecture for grid applications (ROSA-Grids)

Yusuf, I 2012, Recovery-oriented software architecture for grid applications (ROSA-Grids), Doctor of Philosophy (PhD), Computer Science and Information Technology, RMIT University.

Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Yusuf.pdf Thesis Click to show the corresponding preview/stream application/pdf;... 2.51MB
Title Recovery-oriented software architecture for grid applications (ROSA-Grids)
Author(s) Yusuf, I
Year 2012
Abstract Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large-scale projects. Examples of grid resources include high-performance computers, massive data stores, high bandwidth networking, telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the heterogeneity of grid resources, the distribution of these resources over unreliable networks, the complexity of mechanisms that are needed to integrate such resources into a seamless utility, and the dynamic nature of the grid infrastructure that allows continuous changes to happen. To make matters worse, grid applications are generally long running, and these runs repeatedly require coordinated use of many resources at the same time.

In this thesis, we propose the Recovery-Aware Components (RAC) approach. The RAC approach enables a grid application to handle failure reactively and proactively at the level of the smallest and independent execution unit of the application. The approach also combines runtime prediction with a proactive fault tolerance strategy. The RAC approach aims at improving the reliability of the grid application with the least overhead possible. Moreover, to allow a grid fault tolerance manager fine-tuned control and trading off of reliability gained and overhead paid, this thesis offers an architecture-aware modelling and simulation of reliability and overhead. The thesis demonstrates for a few of a dozen or so classes of application architecture already identified in prior research, that the typical architectural structure of the class can be captured in a few parameters. The work shows that these parameters suffice to achieve significant insight into, and control of, such tradeoffs.

The contributions of our research project are as follows. We defined the RAC approach. We showed the usage of the RAC approach for improving the reliability of MapReduce and Combinational Logic grid applications. We provided Markov models that represent the execution behaviour of these applications for reliability and overhead analyses. We analysed the sensitivity of the reliability-overhead tradeoff of the RAC approach to the type of fault tolerance strategy, the parameters of a fault tolerance strategy, prediction interval and a predictor’s accuracy. The final contribution of our research is an experiment testbed that enables a grid fault tolerance expert to evaluate diverse fault tolerance support configurations, and then choose the one that will satisfy the reliability and cost requirements.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Computer Science and Information Technology
Keyword(s) Grid
fault tolerance
Combinational Logic
Version Filter Type
Access Statistics: 315 Abstract Views, 2314 File Downloads  -  Detailed Statistics
Created: Fri, 12 Oct 2012, 10:21:56 EST by Brett Fenton
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us