There are a few reasons why it might be a better idea to have fewer predictor variables rather than having many of them. Read on to find out more.
By Anmol Rajpurohit.
Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post.
Here are a few reasons why it might be a better idea to have fewer predictor variables rather than having many of them:
Redundancy/Irrelevance:
If you are dealing with many predictor variables, then the chances are high that there are hidden relationships between some of them, leading to redundancy. Unless you identify and handle this redundancy (by selecting only the non-redundant predictor variables) in the early phase of data analysis, it can be a huge drag on your succeeding steps.
It is also likely that not all predictor variables are having a considerable impact on the dependent variable(s). You should make sure that the set of predictor variables you select to work on does not have any irrelevant ones – even if you know that data model will take care of them by giving them lower significance.
Note: Redundancy and Irrelevance are two different notions –a relevant feature can be redundant due to the presence of other relevant feature(s).
Overfitting:
Even when you have a large number of predictor variables with no relationships between any of them, it would still be preferred to work with fewer predictors. The data models with large number of predictors (also referred to as complex models) often suffer from the problem of overfitting, in which case the data model performs great on training data, but performs poorly on test data.
链接:
http://www.kdnuggets.com/2017/04/must-know-fewer-predictors-machine-learning-models.html
原文链接:
http://weibo.com/1870858943/ED6cl2wBA?type=comment#_rnd1491389315572