An Intermediate Data Partition Algorithm - Python Projects

Research Area: Big Data

Abstract:

In the parallel computing framework of Hadoop/Spark, data skew is a common problem resulting in performance degradation, such as prolonging of the entire execution time and idle resources. What lies behind this issue is the partition imbalance, that causes significant differences in the amount of data processed by each reduce task. This paper proposes a key reassigning and splitting partition algorithm (SKRSP) to solve the partition skew from the source codes of Spark-core_2.11 project, which considers both the partition balance of the intermediate data and the partition balance after shuffle operators. First, we propose a step-based algorithm for sampling the input data to estimate the general key distribution of entire intermediate data. According to the types of the specific applications, we design two algorithms: hash based key reassigning algorithm (KRHP) and rang based key splitting algorithm (KSRP), which can generate appropriate strategy and implement the skew mitigation in shuffle phase. KKSRP generates the weighted bounds to split intermediate data for the type of sort-based applications while KRHP records these reassigned keys and the new reducers these keys belong to for other applications. Finally, we implement SKRSP in Spark 2.2.0 and evaluate its performance through four benchmarks exhibiting significant data skew: WordCount, Sort, Join, and PageRank. The experimental results verify that our algorithm not only can achieve a better partition balance but also reduce the execution time of reduce tasks effectively.

Keywords:

Author(s) Name: Zhuo Tang,Wei Lv,Kenli Li and Keqin Li

Journal name: IEEE Transactions on Cloud Computing

Conferrence name:

Publisher name: IEEE

DOI: 10.1109/TCC.2018.2878838

Volume Information: April-June 2021, pp. 461-474, vol. 9

Paper Link: https://www.computer.org/csdl/journal/cc/2021/02/08516368/14PnyHVF0xq

Office Address

Social List

An Intermediate Data Partition Algorithm for Skew Mitigation in Spark Computing Environment - 2018

Abstract:

S-Logix (OPC) Private Limited

Office Address

An Intermediate Data Partition Algorithm for Skew Mitigation in Spark Computing Environment - 2018

Abstract:

Related Papers