Skip to main content

Machine learning challenges with Imbalanced Data

For many real world machine learning problem we see an imbalance in the data where one class under represented in relative to others. This leads to mis-classification of elements between classes. The cost of mis-classification is often unknown at learning time and can be far too high.

We often see this type of imbalanced classification scenarios in fraud/intrusion detection, medical diagnosis/monitoring, bio-informatics, text categorization and et al.

To better understand the problem, consider the “Mammography Data Set,” a collection of images acquired from a series of mammography examinations performed on a set of distinct patients. For such a data set, the natural classes that arise are “Positive” or “Negative” for an image representative of a “cancerous” or “healthy” patient, respectively. From experience, one would expect the number of noncancerous patients to exceed greatly the number of cancerous patients; indeed, this data set contains 10,923 “Negative” (majority class) and 260 “Positive” (minority class) samples. Preferably, we require a classifier that provides a balanced degree of predictive accuracy for both the minority and majority classes on the dataset. However, in many standard learning algorithms, we find that classifiers tend to provide a severely imbalanced degree of accuracy, with the majority class having close to 100% accuracy and the minority class having accuracies of 0 ~ 10%. Suppose a classifier achieves 5% accuracy on the minority class of the mammography dataset. Analytically, this would suggest that 247 minority samples are misclassified as majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous). In the medical industry, the ramifications of such a consequence can be overwhelmingly costly, more so than classifying a noncancerous patient as cancerous. Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. In an extreme case, if a given dataset includes 1% of minority class examples and 99% of majority class examples, a naive approach of classifying every example to be a majority class example would provide an accuracy of 99%. Taken at face value, 99% accuracy across the entire dataset appears superb; however, by the same token, this description fails to reflect the fact that none of the minority examples are identified, when in many situations, those minority examples are of much more interest.

A data set is considered imbalanced if the class of interest (positive or minority class) is relatively rare as compared to the other classes (negative or majority classes). As a result, the classifier can be heavily biased toward the majority class. These type of sets suppose a new challenging problem for Data Mining, since standard classification algorithms usually consider a balanced training set and this supposes a bias towards the majority class.

A number of approaches, ranging from re-sampling the data set to directly dealing with skewness of the data have been developed to solve the problem of class imbalance. But much of this research has focused on methods for dealing with imbalanced data, without discussing exactly how or why such methods work or what underlying issues they address.


Popular posts from this blog

Just Buzz... Where is AI?

Speaking to Recode’s Kara Swisher and MSNBC’s Ari Melber, Pichai said AI is “one of the most important things that humanity is working on. It’s more profound than, I don’t know, electricity or fire,” adding that people learned to harness fire for the benefits of humanity, but also needed to overcome its downsides, too. Pichai also said that AI could be used to help solve climate change issues, or to cure cancer.

We are seeing some exciting things in the industry, Samsung’s massive 8K TVs apparently use AI to upscale lower resolution images for the big screen. Sony has created a new version of the Aibo robot dog, which this time promises more artificial intelligence. Travelmate’s robot suitcase will use AI to drive around and follow its owner wherever they go.  Kohler has invented Numi, a toilet that has Amazon’s Alexa voice assistant built in etc.,

But despite all this, it does leave me wondering: is artificial intelligence really what we should be calling this revolution? Because, w…

Congrats! CSC Distinguished Engineers & Architects 2017

Yesterday CSC announced Distinguished Engineers & Architects, Batch 2017
Congrats to all the distinguished folks. Welcome on board...
Distinguished Architects  Randy Arthur (Americas) serves as product owner for CSC’s IaaS offerings and as a lead solutions architect for complex integration projects involving cloud computing technologies. During his 16-year career with CSC, Randy  has worked successfully in various roles including midrange service delivery, pre-sales solution development and product management. He was the first CTO of CSC’s Cloud  technology “incubator.” Bio on | LinkedIn| Twitter
Graham Chastney(UKI&N) is a global domain architect experienced in workplace technologies,  solution strategy and solution governance. He is a global collaborator who is relied upon to provide thought leadership to solution teams and to build and development teams. Graham is the founder  and lead author of the Technology Perspectives blog, which he regards as part of a broader ambition to …

Feasting your programming appetite with microservices

I have been coding for more than 25 years and have used more than 18 different programming languages (and their associated frameworks), spanning every programming style from simple scripting to procedural, to objected-oriented, to dynamic and functional — which is more attractive to the software engineering community these days.

Let me confess, every time I switch to a new language or style I always think, “Oh, hope I can also do that.” Trust me, no language is complete; no style is perfect. So, the only way to feast your programming appetite is to try something different which gives more flexibility. I honestly think the microservices style of application is the one that best delivers that needed flexibility…and the fun that most of us are seeking.

Read complete blog at HyperThink of CSC Ingenious Minds