This blog post introduces seven techniques that are commonly applied in domains like intrusion detection or real-time bidding, because the datasets are often extremely imbalanced.
Introduction
What have datasets in domains like, fraud detection in banking, real-time bidding in marketing or intrusion detection in networks, in common?
Data used in these areas often have less than 1% of rare, but “interesting” events (e.g. fraudsters using credit cards, user clicking advertisement or corrupted server scanning its network). However, most machine learning algorithms do not work very well with imbalanced datasets. The following seven techniques can help you, to train a classifier to detect the abnormal class.
1. Use the right evaluation metrics
Applying inappropriate evaluation metrics for model generated using imbalanced data can be dangerous. Imagine our training data is the one illustrated in graph above. If accuracy is used to measure the goodness of a model, a model which classifies all testing samples into “0” will have an excellent accuracy (99.8%), but obviously, this model won’t provide any valuable information for us.
In this case, other alternative evaluation metrics can be applied such as:
Precision/Specificity: how many selected instances are relevant.
Recall/Sensitivity: how many relevant instances are selected.
F1 score: harmonic mean of precision and recall.
MCC: correlation coefficient between the observed and predicted binary classifications.
AUC: relation between true-positive rate and false positive rate.
2. Resample the training set
Apart from using different evaluation criteria, one can also work on getting different dataset. Two approaches to make a balanced dataset out of an imbalanced one are under-sampling and over-sampling.
2.1. Under-sampling
Under-sampling balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling.
2.2. Over-sampling
On the contrary, oversampling is used when the quantity of data is insufficient. It tries to balance dataset by increasing the size of rare samples. Rather than getting rid of abundant samples, new rare samples are generated by using e.g. repetition, bootstrapping or SMOTE (Synthetic Minority Over-Sampling Technique) [1].
Note that there is no absolute advantage of one resampling method over another. Application of these two methods depends on the use case it applies to and the dataset itself. A combination of over- and under-sampling is often successful as well.
链接;
http://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
原文链接:
http://m.weibo.cn/1402400261/4114562032238642