We investigate the problem of automatically finding an optimal sequence of operators that should be applied to a dataset to remediate the challenges in the data, found in the context of building a supervised machine learning model. We motivate the need for this problem and propose a novel model agnostic reinforcement learning based framework that automatically finds an optimal sequence. The search for an optimal sequence is guided by the need to improve the overall data quality as well as update the dataset in a way so that its boundary complexity is minimized. We also propose novel improvements in the operators used to detect and remediate data quality issues. We show the effectiveness of our approach through extensive experimentation performed on a large number of open-source datasets.
Cite as :
Zhengying Liu, Zhen Xu, Shangeth Rajaa, Meysam Madadi, Julio C. S. Jacques Junior, Sergio Escalera, Adrien Pavao, Sebastien Treguer, Wei-Wei Tu, Isabelle Guyon ; Proceedings of the NeurIPS 2019 Competition and Demonstration Track, PMLR 123:242-252, 2020.