Research Area:  Cloud Computing
A common performance problem in large-scale cloud systems is dealing with straggler tasks that are slow running instances which increase the overall response time. Such tasks impact the systems QoS and the SLA. There is a need for automatic straggler detection and mitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore volatile task characteristics. We propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. START analyzes all tasks and hosts based on compute and network resource consumption using an Encoder LSTM network to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START and compare it with IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler in terms of QoS parameters. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13%, 11%, 16%, 19%, compared to the state-of-the-art.
Author(s) Name:  Shreshth Tuli; Sukhpal Singh Gill; Peter Garraghan; Rajkumar Buyya; Giuliano Casale; Nick Jennings
Journal name:  IEEE Transactions on Services Computing
Publisher name:  IEEE
Volume Information:   Page(s): 1 - 1
Paper Link:   https://ieeexplore.ieee.org/document/9625828