Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication
- Author
- Maria Chtepen (UGent) , Filip HA Claeys, Bart Dhoedt (UGent) , Filip De Turck (UGent) , Peter A Vanrolleghem and Piet Demeester (UGent)
- Organization
- Abstract
- As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several novel heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulation-based comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.
- Keywords
- task replication, fault-tolerance, adaptive checkpointing, grid computing
Downloads
-
(...).pdf
- full text
- |
- UGent only
- |
- |
- 382.47 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-370536
- MLA
- Chtepen, Maria, et al. “Providing Fault-Tolerance in Unreliable Grid Systems through Adaptive Checkpointing and Replication.” Lecture Notes in Computer Science, edited by Y Shi et al., vol. 4487, Springer, 2007, pp. 454–61, doi:10.1007/978-3-540-72584-8_60.
- APA
- Chtepen, M., Claeys, F. H., Dhoedt, B., De Turck, F., Vanrolleghem, P. A., & Demeester, P. (2007). Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication. In Y. Shi, G. van Albada, J. Dongarra, & P. Sloot (Eds.), Lecture Notes in Computer Science (Vol. 4487, pp. 454–461). https://doi.org/10.1007/978-3-540-72584-8_60
- Chicago author-date
- Chtepen, Maria, Filip HA Claeys, Bart Dhoedt, Filip De Turck, Peter A Vanrolleghem, and Piet Demeester. 2007. “Providing Fault-Tolerance in Unreliable Grid Systems through Adaptive Checkpointing and Replication.” In Lecture Notes in Computer Science, edited by Y Shi, GD van Albada, J Dongarra, and PMA Sloot, 4487:454–61. Berlin, Germany: Springer. https://doi.org/10.1007/978-3-540-72584-8_60.
- Chicago author-date (all authors)
- Chtepen, Maria, Filip HA Claeys, Bart Dhoedt, Filip De Turck, Peter A Vanrolleghem, and Piet Demeester. 2007. “Providing Fault-Tolerance in Unreliable Grid Systems through Adaptive Checkpointing and Replication.” In Lecture Notes in Computer Science, ed by. Y Shi, GD van Albada, J Dongarra, and PMA Sloot, 4487:454–461. Berlin, Germany: Springer. doi:10.1007/978-3-540-72584-8_60.
- Vancouver
- 1.Chtepen M, Claeys FH, Dhoedt B, De Turck F, Vanrolleghem PA, Demeester P. Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication. In: Shi Y, van Albada G, Dongarra J, Sloot P, editors. Lecture Notes in Computer Science. Berlin, Germany: Springer; 2007. p. 454–61.
- IEEE
- [1]M. Chtepen, F. H. Claeys, B. Dhoedt, F. De Turck, P. A. Vanrolleghem, and P. Demeester, “Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication,” in Lecture Notes in Computer Science, Beijing, PR China, 2007, vol. 4487, pp. 454–461.
@inproceedings{370536, abstract = {{As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several novel heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulation-based comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.}}, author = {{Chtepen, Maria and Claeys, Filip HA and Dhoedt, Bart and De Turck, Filip and Vanrolleghem, Peter A and Demeester, Piet}}, booktitle = {{Lecture Notes in Computer Science}}, editor = {{Shi, Y and van Albada, GD and Dongarra, J and Sloot, PMA}}, isbn = {{9783540725831}}, issn = {{0302-9743}}, keywords = {{task replication,fault-tolerance,adaptive checkpointing,grid computing}}, language = {{eng}}, location = {{Beijing, PR China}}, pages = {{454--461}}, publisher = {{Springer}}, title = {{Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication}}, url = {{http://doi.org/10.1007/978-3-540-72584-8_60}}, volume = {{4487}}, year = {{2007}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: