Standardization and Data Augmentation in Genetic Programming.

Authors :: Owen, Caitlin A.
Dick, Grant
Whigham, Peter A.
Source :: IEEE Transactions on Evolutionary Computation; Dec2022, Vol. 26 Issue 6, p1596-1608, 13p
Publication Year :: 2022
Abstract: Genetic programming (GP) is a common method for performing symbolic regression that relies on the use of ephemeral random constants in order to adequately scale predictions. Suitable values for these constants must be drawn from appropriate, but typically unknown, distributions for the problem being modeled. While rarely used with GP, $Z$ -score standardization of feature and response spaces often significantly improves the predictive performance of GP by removing scale issues and reducing error due to bias. However, in some cases it is also associated with erratic error due to variance. This article demonstrates that this variance component increases in the presence of gaps at the boundaries of the training data explanatory variable intervals. An initial solution to this problem is proposed that augments training data with pseudo instances located at the boundaries of the intervals. When applied to benchmark problems, particularly with small training samples, this solution reduces error due to variance and, therefore, total error. Augmentation is shown to also stabilize error in larger problems; however, results suggest that standardized GP works well on such problems with little need for training data augmentation. [ABSTRACT FROM AUTHOR]