专栏名称: 机器学习研究会
机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织,旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外,协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。
目录
相关文章推荐
爱可可-爱生活  ·  [LG]《Model Balancing ... ·  昨天  
黄建同学  ·  LangChain发布了开源的Open ... ·  昨天  
爱可可-爱生活  ·  【Tabled:一款能从PDF和图片中智能识 ... ·  昨天  
宝玉xp  ·  转发微博-20241016022529 ·  3 天前  
51好读  ›  专栏  ›  机器学习研究会

【学习】为什么说机器学习模型中的预测变量可能越少越好?

机器学习研究会  · 公众号  · AI  · 2017-04-05 19:19

正文



点击上方“机器学习研究会”可以订阅哦
摘要
 

转自:ArnetMiner

There are a few reasons why it might be a better idea to have fewer predictor variables rather than having many of them. Read on to find out more.

By .

Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post.

Here are a few reasons why it might be a better idea to have fewer predictor variables rather than having many of them:


Redundancy/Irrelevance:

If you are dealing with many predictor variables, then the chances are high that there are hidden relationships between some of them, leading to redundancy. Unless you identify and handle this redundancy (by selecting only the non-redundant predictor variables) in the early phase of data analysis, it can be a huge drag on your succeeding steps.


It is also likely that not all predictor variables are having a considerable impact on the dependent variable(s). You should make sure that the set of predictor variables you select to work on does not have any irrelevant ones – even if you know that data model will take care of them by giving them lower significance.

Note: Redundancy and Irrelevance are two different notions –a relevant feature can be redundant due to the presence of other relevant feature(s).


Overfitting:

Even when you have a large number of predictor variables with no relationships between any of them, it would still be preferred to work with fewer predictors. The data models with large number of predictors (also referred to as complex models) often suffer from the problem of overfitting, in which case the data model performs great on training data, but performs poorly on test data.


链接:

http://www.kdnuggets.com/2017/04/must-know-fewer-predictors-machine-learning-models.html


原文链接:

http://weibo.com/1870858943/ED6cl2wBA?type=comment#_rnd1491389315572

“完整内容”请点击【阅读原文】
↓↓↓