Skip to main content

Sampling strategies for Imbalanced Learning

As discussed in my previous blog, Imbalanced data poses serious challenges in Machine Learning.  One of approach to combat this imbalance is data is to alter the training set in such a way as to create a more balanced class distribution so that the resulting sampled data set can be used with traditional data-mining algorithms. This can be achieved through... 
  1. Under-sample where the size of the majority class is reduced using different techniques like reducing redundancy, removing boundary candidates etc.,
  2. Over-sample where the size of the minority class is increased by adding more candidates which can augment the data set.
  3. Hybrid approach where a combination of both oversampling of minority class and under sampling of majority class is attempted.
Each of these techniques discussed below

Random Over Sampling

In random over-sampling, the minority class instances are duplicated in the data set until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 2 minority instances are duplicated repeated, at random, so that complete data set can be balanced across the classes.

This mode of data set duplication leads to the problem of over-fitting (each instance is minority class is copied 49 times leads to lots of generalization). To overcome this problem different techniques of synthesizing data instance based on different attributes

Synthetic Minority Over-sampling technique (SMOTE)

In this technique, the training set is altered by adding synthetically generated minority class instances, causing the class distribution to become more balanced. To create the new synthetic minority class instances, SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

Some of the variants of SMOTE heavily employed includes Borderline SMOTE, Safe-level SMOTE

Clustering Method 

First cluster all of the minority class instances using k-means clustering. They then over sample each of the clusters to have the same number of instances, and the overall dataset to be balanced. The purpose of this method is to identify the disparate regions in the feature space where minority
class instances are found and to ensure that each region is equally represented with minority class instances

Focused Re-sampling

With this technique, only minority class instances that occur on the boundary between minority and majority class instances are over-sampled. In this way, redundant instances are reduced, and better performance can be achieved.

Random Under Sampling

In random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 98 majority instances are reduced at random so that complete data set can be balanced across the classes.

This mode of data set reduction generally leads to loss of potentially useful information (nearly 82% of the data from majority class is lost). To overcome these issues various techniques are developed to avoid using useful information by reducing the majority class through redundant noise and / or borderline.

Tomek Links and CNN

This is combination of Tomek Links, which are borderline and noisy instances, and Condensed Nearest Neighbor (CNN), removes redundant instances.

Neighborhood Cleaning Rule

This is based on Edited Nearest Neighbor rule (ENN) where in an instance of the majority class is removed from the dataset if it is misclassified by its three nearest neighbors in the majority class or misclassified by three nearest neighbors in minority class.

Hybrid Techniques

A hybrid approach can over the limitation of over-sampling and under-sampling by employing a combination of both over-sampling and under-sampling to make sure neither too much of information is lost nor too much of over-fitting.

Some of the hybrid techniques to mention, SMOTE+Tomek and SMOTE+ENN where SMOTE is used to over-sample the minority class while Tomek/ENN is using to under-sample the majority class

Ensemble Techniques

Along side these traditional sampling techniques we have ensemble based sampling techniques which promise better sample of data to reduce imbalance. Ensemble based sampling techniques discussed here

Above mention strategies can help balance the training data set. But one big challenge with sampling strategies is deciding how much to sample, which is obviously conditioned on the sampling strategy that is deployed. We have different approaches to help discover the sample sampling strategy and amount of sampling to be done. But in general, this is a difficult optimization problem and may prove impractical in practice depending on the size of the dataset and level of imbalance.

Comments

Popular posts from this blog

Distinguished Engineer Award

Congrats! CSC Distinguished Engineers & Architects 2017

Yesterday CSC announced Distinguished Engineers & Architects, Batch 2017
Congrats to all the distinguished folks. Welcome on board...
Distinguished Architects  Randy Arthur (Americas) serves as product owner for CSC’s IaaS offerings and as a lead solutions architect for complex integration projects involving cloud computing technologies. During his 16-year career with CSC, Randy  has worked successfully in various roles including midrange service delivery, pre-sales solution development and product management. He was the first CTO of CSC’s Cloud  technology “incubator.” Bio on csc.com | LinkedIn| Twitter
Graham Chastney(UKI&N) is a global domain architect experienced in workplace technologies,  solution strategy and solution governance. He is a global collaborator who is relied upon to provide thought leadership to solution teams and to build and development teams. Graham is the founder  and lead author of the Technology Perspectives blog, which he regards as part of a broader ambition to …

Just Buzz, where is AI?

Speaking to Recode’s Kara Swisher and MSNBC’s Ari Melber, Pichai said AI is “one of the most important things that humanity is working on. It’s more profound than, I don’t know, electricity or fire,” adding that people learned to harness fire for the benefits of humanity, but also needed to overcome its downsides, too. Pichai also said that AI could be used to help solve climate change issues, or to cure cancer.

We are seeing some exciting things in the industry, Samsung’s massive 8K TVs apparently use AI to upscale lower resolution images for the big screen. Sony has created a new version of the Aibo robot dog, which this time promises more artificial intelligence. Travelmate’s robot suitcase will use AI to drive around and follow its owner wherever they go.  Kohler has invented Numi, a toilet that has Amazon’s Alexa voice assistant built in etc.,

But despite all this, it does leave me wondering: is artificial intelligence really what we should be calling this revolution? Because, …