Thursday, October 6, 2011

Sampling strategies for Imbalanced Learning

As discussed in my previous blog, Imbalanced data poses serious challenges in Machine Learning.  One of approach to combat this imbalance is data is to alter the training set in such a way as to create a more balanced class distribution so that the resulting sampled data set can be used with traditional data-mining algorithms. This can be achieved through... 
  1. Under-sample where the size of the majority class is reduced using different techniques like reducing redundancy, removing boundary candidates etc.,
  2. Over-sample where the size of the minority class is increased by adding more candidates which can augment the data set.
  3. Hybrid approach where a combination of both oversampling of minority class and under sampling of majority class is attempted.
Each of these techniques discussed below

Random Over Sampling

In random over-sampling, the minority class instances are duplicated in the data set until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 2 minority instances are duplicated repeated, at random, so that complete data set can be balanced across the classes.

This mode of data set duplication leads to the problem of over-fitting (each instance is minority class is copied 49 times leads to lots of generalization). To overcome this problem different techniques of synthesizing data instance based on different attributes

Synthetic Minority Over-sampling technique (SMOTE)

In this technique, the training set is altered by adding synthetically generated minority class instances, causing the class distribution to become more balanced. To create the new synthetic minority class instances, SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

Some of the variants of SMOTE heavily employed includes Borderline SMOTE, Safe-level SMOTE

Clustering Method 

First cluster all of the minority class instances using k-means clustering. They then over sample each of the clusters to have the same number of instances, and the overall dataset to be balanced. The purpose of this method is to identify the disparate regions in the feature space where minority
class instances are found and to ensure that each region is equally represented with minority class instances

Focused Re-sampling

With this technique, only minority class instances that occur on the boundary between minority and majority class instances are over-sampled. In this way, redundant instances are reduced, and better performance can be achieved.

Random Under Sampling

In random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 98 majority instances are reduced at random so that complete data set can be balanced across the classes.

This mode of data set reduction generally leads to loss of potentially useful information (nearly 82% of the data from majority class is lost). To overcome these issues various techniques are developed to avoid using useful information by reducing the majority class through redundant noise and / or borderline.

Tomek Links and CNN

This is combination of Tomek Links, which are borderline and noisy instances, and Condensed Nearest Neighbor (CNN), removes redundant instances.

Neighborhood Cleaning Rule

This is based on Edited Nearest Neighbor rule (ENN) where in an instance of the majority class is removed from the dataset if it is misclassified by its three nearest neighbors in the majority class or misclassified by three nearest neighbors in minority class.

Hybrid Techniques

A hybrid approach can over the limitation of over-sampling and under-sampling by employing a combination of both over-sampling and under-sampling to make sure neither too much of information is lost nor too much of over-fitting.

Some of the hybrid techniques to mention, SMOTE+Tomek and SMOTE+ENN where SMOTE is used to over-sample the minority class while Tomek/ENN is using to under-sample the majority class

Ensemble Techniques

Along side these traditional sampling techniques we have ensemble based sampling techniques which promise better sample of data to reduce imbalance. Ensemble based sampling techniques discussed here

Above mention strategies can help balance the training data set. But one big challenge with sampling strategies is deciding how much to sample, which is obviously conditioned on the sampling strategy that is deployed. We have different approaches to help discover the sample sampling strategy and amount of sampling to be done. But in general, this is a difficult optimization problem and may prove impractical in practice depending on the size of the dataset and level of imbalance.

No comments:

Post a Comment