Optimization of the Join between Large Tables in the Spark Distributed Framework.

Authors :: Wu, Xiang
He, Yueshun
Source :: Applied Sciences (2076-3417); May2023, Vol. 13 Issue 10, p6257, 14p
Publication Year :: 2023
Abstract: The Join task between Spark large tables takes a long time to run and produces a lot of disk I/O, network I/O and disk occupation in the Shuffle process. This paper proposes a lightweight distributed data filtering model that combines broadcast variables and accumulators using RoaringBitmap. When the data in the two tables are not exactly matched, the dimension table Key is collected through the accumulator, compressed by RoaringBitmap and distributed to each node using broadcast variables. The distributed fact table data can be pre-filtered on the local server, which effectively reduces the data transmission and disk reading and writing in the Shuffle phase. Experimental results show that this optimization method can reduce disk usage, shorten the running time and reduce network I/O and disk I/O for Spark Join tasks in the case of massive data, and the effect is more obvious when the two tables have a higher incomplete matching degree or a fixed matching degree but a larger amount of data. This optimization scheme has the advantages of being easy to use, being easy to maintain and having an obvious effect, and it can be applied to many development scenarios. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools