参加Kaggle数据挖掘比赛的两大利器: xgboost和模型融合(Model Ensemble)。把多个单模型融合在一起能够降低bias,variance,控制Overfitting,提高准确率。下文解释了为什么Ensemble能够起到这些作用,还介绍了几种常用的Ensemble的方法: (weighted)vote, averaging, stacking,blending。
Model ensembling is a very powerful technique to increase
accuracy on a variety of ML tasks. In this article I will share my
ensembling approaches for Kaggle Competitions.
For the first part we look at creating ensembles from submission
files. The second part will look at creating ensembles through stacked
generalization/blending.
I answer why ensembling reduces the generalization error. Finally I
show different methods of ensembling, together with their results and
code to try it out for yourself.
This is how you win ML competitions: you take other peoples’ work and ensemble them together.”Vitaly Kuznetsov NIPS2014
Creating ensembles from submission files
The most basic and convenient way to ensemble is to ensemble Kaggle
submission CSV files. You only need the predictions on the test set for
these methods — no need to retrain a model. This makes it a quick way to
ensemble already existing model predictions, ideal when teaming up.
Voting ensembles.
We first take a look at a simple majority vote ensemble. Let’s see
why model ensembling reduces error rate and why it works better to
ensemble low-correlated model predictions.
链接:
https://mlwave.com/kaggle-ensembling-guide/
代码链接:
https://github.com/MLWave/Kaggle-Ensemble-Guide
原文链接:
http://weibo.com/3983872447/EEqibcLw7?type=comment#_rnd1492246619714