Research Area:  Cloud Computing
Resubmission and replication are two fundamental and widely recognized techniques in distributed computing systems for fault tolerance. The resubmission based strategy has an advantage in resource utilization, while the replication based strategy can reduce the task completed time in the context of fault. However, few researches take these two techniques together for fault-tolerant workflow scheduling, especially in Cloud systems. In this paper, we present a novel fault-tolerant workflow scheduling (ICFWS) algorithm for Cloud systems by combining the aforementioned two strategies together to play their respective advantages for fault tolerance while trying to meet the soft deadline of workflow. First, it divides the soft deadline of workflow into multiple sub-deadlines for all tasks. Then, it selects a reasonable fault-tolerant strategy and reserves suitable resource for each task by taking the imbalance sub-deadlines among tasks and on-demand resource provisioning of Cloud systems into consideration. Finally, an online scheduling and reservation adjustment scheme is designed to select a suitable resource for the task with resubmission strategy and adjust the sub-deadlines as well as fault-tolerant strategies of some unexecuted tasks during the task execution process, respectively. The proposed algorithm is evaluated on both real-world and randomly generated workflows. The results demonstrate that the ICFWS outperforms some well-known approaches on corresponding metrics.
Keywords:  
Author(s) Name:  Guangshun Yao; Yongsheng Ding and Kuangrong Hao
Journal name:  : IEEE Transactions on Parallel and Distributed Systems
Conferrence name:  
Publisher name:  IEEE
DOI:  10.1109/TPDS.2017.2687923
Volume Information:  Volume: 28, Issue: 12, Dec. 1 2017,Page(s): 3671 - 3683
Paper Link:   https://ieeexplore.ieee.org/document/7887706