Amazing technological breakthrough possible @S-Logix

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • +91- 81240 01111

Social List

Proactive Failure-Aware Task Scheduling Framework for Cloud Computing - 2021

Proactive Failure-Aware Task Scheduling Framework for Cloud Computing

Research Area:  Cloud Computing


Cloud computing is a widely adopted platform for executing tasks of different application types that belong to the end users. In the cloud, application task is prone to failure for several reasons, such as software bug or exception, virtual or physical infrastructure failure. Cloud service providers are responsible for managing availability of scheduled computing tasks in order to provide high level QoS for their customers. Protecting task against failure is a challenging and not a trivial mission due to dynamic, heterogeneous and large distributed structure of the cloud environment. The existing works in the literature focus on task failure prediction and neglect the remedy (post) actions. In this work, we first study and analyze three publicly available large cluster datasets from Google, Alibaba, and Trinity, to characterize task failure in cloud computing platform. We then propose a failure-aware task scheduling framework that can predict the termination status for a set of given tasks during the runtime, and take the appropriate remedy actions. The framework uses deep learning methods named Artificial and Convolutional Neural Network, ANN and CNN, for different prediction purposes. In addition, we formalize the actions selection problem as Integer Linear Programming (ILP) model and propose a heuristic optimization solution that aims to minimize the failure probability of tasks and their resources usage. The results show ANN and CNN can achieve prediction accuracy of up to 94% and 92%, respectively using Google dataset. Moreover, the framework can protect up to 40% of tasks that are predicted as failed using Alibaba dataset by taking the appropriate remedy actions, and hence save many of clusters resources such as CPU and RAM.


Author(s) Name:  Yanal Alahmad; Tariq Daradkeh; Anjali Agarwal

Journal name:  IEEE Access

Conferrence name:  

Publisher name:  IEEE

DOI:  10.1109/ACCESS.2021.3101147

Volume Information:  ( Volume: 9) Page(s): 106152 - 106168