Skip to main content

Sampling strategies for Imbalanced Learning

As discussed in my previous blog, Imbalanced data poses serious challenges in Machine Learning.  One of approach to combat this imbalance is data is to alter the training set in such a way as to create a more balanced class distribution so that the resulting sampled data set can be used with traditional data-mining algorithms. This can be achieved through... 
  1. Under-sample where the size of the majority class is reduced using different techniques like reducing redundancy, removing boundary candidates etc.,
  2. Over-sample where the size of the minority class is increased by adding more candidates which can augment the data set.
  3. Hybrid approach where a combination of both oversampling of minority class and under sampling of majority class is attempted.
Each of these techniques discussed below

Random Over Sampling

In random over-sampling, the minority class instances are duplicated in the data set until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 2 minority instances are duplicated repeated, at random, so that complete data set can be balanced across the classes.

This mode of data set duplication leads to the problem of over-fitting (each instance is minority class is copied 49 times leads to lots of generalization). To overcome this problem different techniques of synthesizing data instance based on different attributes

Synthetic Minority Over-sampling technique (SMOTE)

In this technique, the training set is altered by adding synthetically generated minority class instances, causing the class distribution to become more balanced. To create the new synthetic minority class instances, SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

Some of the variants of SMOTE heavily employed includes Borderline SMOTE, Safe-level SMOTE

Clustering Method 

First cluster all of the minority class instances using k-means clustering. They then over sample each of the clusters to have the same number of instances, and the overall dataset to be balanced. The purpose of this method is to identify the disparate regions in the feature space where minority
class instances are found and to ensure that each region is equally represented with minority class instances

Focused Re-sampling

With this technique, only minority class instances that occur on the boundary between minority and majority class instances are over-sampled. In this way, redundant instances are reduced, and better performance can be achieved.

Random Under Sampling

In random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 98 majority instances are reduced at random so that complete data set can be balanced across the classes.

This mode of data set reduction generally leads to loss of potentially useful information (nearly 82% of the data from majority class is lost). To overcome these issues various techniques are developed to avoid using useful information by reducing the majority class through redundant noise and / or borderline.

Tomek Links and CNN

This is combination of Tomek Links, which are borderline and noisy instances, and Condensed Nearest Neighbor (CNN), removes redundant instances.

Neighborhood Cleaning Rule

This is based on Edited Nearest Neighbor rule (ENN) where in an instance of the majority class is removed from the dataset if it is misclassified by its three nearest neighbors in the majority class or misclassified by three nearest neighbors in minority class.

Hybrid Techniques

A hybrid approach can over the limitation of over-sampling and under-sampling by employing a combination of both over-sampling and under-sampling to make sure neither too much of information is lost nor too much of over-fitting.

Some of the hybrid techniques to mention, SMOTE+Tomek and SMOTE+ENN where SMOTE is used to over-sample the minority class while Tomek/ENN is using to under-sample the majority class

Ensemble Techniques

Along side these traditional sampling techniques we have ensemble based sampling techniques which promise better sample of data to reduce imbalance. Ensemble based sampling techniques discussed here

Above mention strategies can help balance the training data set. But one big challenge with sampling strategies is deciding how much to sample, which is obviously conditioned on the sampling strategy that is deployed. We have different approaches to help discover the sample sampling strategy and amount of sampling to be done. But in general, this is a difficult optimization problem and may prove impractical in practice depending on the size of the dataset and level of imbalance.

Comments

Popular posts from this blog

Just Buzz... Where is AI?

Speaking to Recode’s Kara Swisher and MSNBC’s Ari Melber, Pichai said AI is “one of the most important things that humanity is working on. It’s more profound than, I don’t know, electricity or fire,” adding that people learned to harness fire for the benefits of humanity, but also needed to overcome its downsides, too. Pichai also said that AI could be used to help solve climate change issues, or to cure cancer. We are seeing some exciting things in the industry, Samsung’s massive 8K TVs apparently use AI to upscale lower resolution images for the big screen. Sony has created a new version of the Aibo robot dog, which this time promises more artificial intelligence. Travelmate’s robot suitcase will use AI to drive around and follow its owner wherever they go.  Kohler has invented Numi, a toilet that has Amazon’s Alexa voice assistant built in etc., But despite all this, it does leave me wondering: is artificial intelligence really what we should be calling this revolution? Bec

Feasting your programming appetite with microservices

I have been coding for more than 25 years and have used more than 18 different programming languages (and their associated frameworks), spanning every programming style from simple scripting to procedural, to objected-oriented, to dynamic and functional — which is more attractive to the software engineering community these days. Let me confess, every time I switch to a new language or style I always think, “Oh, hope I can also do  that .” Trust me, no language is complete; no style is perfect. So, the only way to feast your programming appetite is to try something different which gives more flexibility. I honestly think the microservices style of application is the one that best delivers that needed flexibility…and the fun that most of us are seeking. Read complete blog at HyperThink of CSC Ingenious Minds

Effective Pattern Identification Model for DDoS Attack Detection

Abstract: Distributed Denial of Service (DDoS) attacks are one of the major challenges to Internet community. Attackers send legitimate packets with often changing information from various compromised systems at random and at a very high frequency, rendering the target non-responsive for normal traffic. DDoS attacks are difficult to detect with traditional detection methods and standard Intrusion Detection Systems (IDS). Standard IDS tries to analyze the network traffic or system logs trying to identify emerging patterns on the network traffic. But due to randomness of the package origins it is difficult segregate true, false positive and normal traffic. This paper proposes a model based on Artificial Neural Networks to identify anomalies and detect DDoS patterns. In the proposed system sets of known characteristic features, which can separate attacks from normal traffic, are fed to the system to train the Artificial Neural Networks (ANN). This self learn system improves with each n