1. PigReuse: A Reuse-based Optimizer for Pig Latin
- Author
-
Camacho-Rodríguez, Jesús, Colazzo, Dario, Herschel, Melanie, Manolescu, Ioana, Roy Chowdhury, Soudip, Hortonworks Inc., Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision (LAMSADE), Université Paris Dauphine-PSL, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Université Paris sciences et lettres (PSL), Institute of Parallel and Distributed Systems [Stuttgart] (IPVS), Rich Data Analytics at Cloud Scale (CEDAR), Laboratoire d'informatique de l'École polytechnique [Palaiseau] (LIX), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS), Fractal Analytics Inc., Inria Saclay, Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Inria Saclay - Ile de France, and Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)
- Subjects
PigLatin ,[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB] ,ACM: H.: Information Systems/H.2: DATABASE MANAGEMENT/H.2.4: Systems/H.2.4.5: Query processing ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Reuse-based Optimization ,Linear Programming ,ACM: H.: Information Systems/H.2: DATABASE MANAGEMENT/H.2.4: Systems - Abstract
Pig Latin is a popular language which is widely used for parallel processing of massive data sets. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they appear, and the current Pig Latin optimizer does not identify reuse opportunities.We present a novel optimization approach aiming at identifying and reusing repeated subexpressions in Pig Latin scripts. Our optimization algorithm, named PigReuse, operates on a particular algebraic representation of Pig Latin scripts. PigReuse identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and reuses their results as needed in order to compute exactly the same output as the original scripts. Our experiments demonstrate the effectiveness of our approach.
- Published
- 2016