Classification in Class Imbalanced Datasets

In this thesis we study the classification task in the presence of class imbalanced data. This task arises in many applications when we are interested in the under-represented (minority) classes. Examples of such applications are related to fraud detection, medical diagnosis and monitoring, text categorization, risk management, information retrieval and filtering. Although there exist many standard approaches to the classification task, most of them have poor generalisation performance on the minority class.

This thesis studies well-known approaches to the classification problem in the presence of class imbalanced data, such as Cost-Sensitivity, Bagging for Imbalanced Datasets, MetaCost and SMOTE. The main contribution of the thesis is a new approach to the problem that we call Minority-Class Instance Generation through Feature Bagging. The approach is a generative approach. It generates new instances of the minority class by sampling values of each feature present in the training data. Experiments show the superiority of our approach on 4 UCI datasets and a medical dataset provided by KULeuven.

Keywords: Data Mining, Classification, Imbalanced, Unbalanced, Sampling, Random Efiects, Randomization, Voting, Cost-sensitivity, Naive Bayes, General Practice

 

  1. Introduction
  2. Classification
    1. Classification Task
    2. Example of a classifier: C4.5
    3. Evaluation
    4. Conclusion
    5. References
  3. Classification in Class Imbalanced Datasets
    1. Problem Description
    2. Class Imbalance, Noise and Outliers
    3. Approaches
    4. Conclusion
    5. References
  4. Novel approach to Classification in Class Imbalanced Datasets
    1. Introduction
    2. Approach
    3. Implementation