Exact and Approximate SMOTE Algorithms for Big Data in PySpark
Table of Contents
Introduction
Myself and two colleagues developed 3 SMOTE algorithms for handling imbalanced datasets in PySpark. We also developed an ENN algorithm, using the NearestNeighbors algorithms developed for SMOTE.
The algorithms are:
- Local SMOTE
- Approxmiate SMOTE
- Exact SMOTE
- Exact SMOTE with ENN
Method
Local SMOTE
This involved applying the SMOTE algorithm to each partition of the data, and then combining the results. This was the simplest algorithm to implement, but was also the slowest.
Approximate SMOTE
This involed using LSH and BucketedRandomProjection to generate a list of approximate nearest neighbors for each point in the minority class. From here, we can create synthetic data using a map and udf.
Exact SMOTE
The data structure used for Exact SMOTE was a KDTree. A KDTree can be generate in O(k log n) time, so the bottleneck lies in querying this KDTree. To solve this problem we simply broadcasted the KDTree to a mapPartitions function to allow all the worker nodes to query the tree. Once the nearest neighbors have been collated, we can synthetise new data in the same way as Approx-SMOTE.
Exact SMOTE with ENN
This is an ensemble method which uses ENN to remove noise from the synthetically created data. This is done by using the same KDTree as Exact SMOTE, but instead of using the nearest neighbors to create synthetic data, we take the sum of the target label of the nearest neighbors and remove any points that are missclassified by their nearest neighbors.
Key Technologies
- Python
- DataBricks