Research Area:  Cloud Computing
Scheduling big data processing workflows involves both large-scale tasks and transmission of massive intermediate data among tasks, thus optimizing their completion time and monetary cost becomes a challenging issue. Besides, data streams are continuously generated, and dynamically submitted to clouds for real-time or near real-time processing. Naturally, responsive schedules are required to keep pace with such dynamic environments and this further aggravates the difficulty of the workflow scheduling problem. To address these issues, we first derive two theorems to minimize the completion time of a set of parallel workflow tasks and the start time of each workflow task, and then define the latest finish time for workflow tasks, which is also proved its advantage in reducing costs without delaying the completion of workflows. On the basis of these theorems, we propose a novel real-time scheduling algorithm using task-duplication, RTSATD, such that minimizing both the completion time and monetary cost of processing big data workflows in clouds. The performance of RTSATD is analyzed by using both synthesized and real-world workflows. The experimental results demonstrate the superiority of the proposed algorithm with respect to completion time (up to 28.73 percent) and resource utilization (up to 46.31 percent) over two existing approaches.
Author(s) Name:  Huangke Chen; Jinming Wen; Witold Pedrycz and Guohua Wu
Journal name:   IEEE Transactions on Big Data
Publisher name:  IEEE
Volume Information:  Volume: 6, Issue: 1, March 1 2020, Page(s): 131 - 144
Paper Link:   https://ieeexplore.ieee.org/document/8485300