Advanced search
1 file | 3.42 MB Add to list

Adaptive task checkpointing and replication: toward efficient fault-tolerant grids

Author
Organization
Abstract
A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the abovementioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment Dynamic Scheduling in Distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.
Keywords
performance of systems, Distributed systems, fault tolerance, availability, PERFORMANCE, INTERVAL

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 3.42 MB

Citation

Please use this url to cite or link to this publication:

MLA
Chtepen, Maria et al. “Adaptive Task Checkpointing and Replication: Toward Efficient Fault-tolerant Grids.” IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 20.2 (2009): 180–190. Print.
APA
Chtepen, M., Claeys, F. H., Dhoedt, B., De Turck, F., Demeester, P., & Vanrolleghem, P. A. (2009). Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 20(2), 180–190.
Chicago author-date
Chtepen, Maria, Filip HA Claeys, Bart Dhoedt, Filip De Turck, Piet Demeester, and Peter A Vanrolleghem. 2009. “Adaptive Task Checkpointing and Replication: Toward Efficient Fault-tolerant Grids.” Ieee Transactions on Parallel and Distributed Systems 20 (2): 180–190.
Chicago author-date (all authors)
Chtepen, Maria, Filip HA Claeys, Bart Dhoedt, Filip De Turck, Piet Demeester, and Peter A Vanrolleghem. 2009. “Adaptive Task Checkpointing and Replication: Toward Efficient Fault-tolerant Grids.” Ieee Transactions on Parallel and Distributed Systems 20 (2): 180–190.
Vancouver
1.
Chtepen M, Claeys FH, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA. Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS. 2009;20(2):180–90.
IEEE
[1]
M. Chtepen, F. H. Claeys, B. Dhoedt, F. De Turck, P. Demeester, and P. A. Vanrolleghem, “Adaptive task checkpointing and replication: toward efficient fault-tolerant grids,” IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, vol. 20, no. 2, pp. 180–190, 2009.
@article{669790,
  abstract     = {A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the abovementioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment Dynamic Scheduling in Distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.},
  author       = {Chtepen, Maria and Claeys, Filip HA and Dhoedt, Bart and De Turck, Filip and Demeester, Piet and Vanrolleghem, Peter A},
  issn         = {1045-9219},
  journal      = {IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS},
  keywords     = {performance of systems,Distributed systems,fault tolerance,availability,PERFORMANCE,INTERVAL},
  language     = {eng},
  number       = {2},
  pages        = {180--190},
  title        = {Adaptive task checkpointing and replication: toward efficient fault-tolerant grids},
  url          = {http://dx.doi.org/10.1109/TPDS.2008.93},
  volume       = {20},
  year         = {2009},
}

Altmetric
View in Altmetric
Web of Science
Times cited: