专栏名称: 数据分析1480
积土成山,积水成渊!!定期与大家分享数据分析和挖掘方面的干货,包括分析工具R与Python的使用、数据分析的案例、及最新的数据领域资讯。
目录
相关文章推荐
上海本地宝  ·  外地户口也能申请!上海公租房房源上新! ·  3 天前  
春城晚报  ·  凌晨通报:29人失联!原因公布 ·  2 天前  
云南气象  ·  2月10-11日云南西北部有雨雪天气 ... ·  3 天前  
云南新闻网  ·  【行摄云南】“郁”见春城浪漫 ·  4 天前  
51好读  ›  专栏  ›  数据分析1480

清洗数据,我习惯用这 7 步!

数据分析1480  · 公众号  ·  · 2020-04-03 09:10

正文

数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:

  • Step1 : read csv
  • Step2 : preview data
  • Step3: check null value for every column
  • Step4: complete null value
  • Step5: feature engineering
    • Step 5.1: delete some features
    • Step 5.2: create new feature
  • Step6: encode for categories columns
    • Step 6.1: Sklearn LabelEncode
    • Step 6.2: Pandas get_dummies
  • Step 7: check for data cleaning

今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。

1 读入数据

这不废话吗,第一步就是读入数据。

data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw

结果:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
891 rows × 12 columns

2 数据预览

data_raw.info()
data_raw.describe(include='all')

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex:
891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
count891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
uniqueNaNNaNNaN8912NaNNaNNaN681NaN1473
topNaNNaNNaNHakkarainen, Mr. Pekka PietarimaleNaNNaNNaN1601NaNG6S
freqNaNNaNNaN1577NaNNaNNaN7NaN4644
mean446.0000000.3838382.308642NaNNaN29.6991180.5230080.381594NaN32.204208NaNNaN
std257.3538420.4865920.836071NaNNaN14.5264971.1027430.806057NaN49.693429NaNNaN
min1.0000000.0000001.000000NaNNaN0.4200000.0000000.000000NaN0.000000NaNNaN
25%223.5000000.0000002.000000NaNNaN20.1250000.0000000.000000NaN7.910400NaNNaN
50%446.0000000.0000003.000000NaNNaN28.0000000.0000000.000000NaN14.454200NaNNaN
75%668.5000001.0000003.000000 NaNNaN38.0000001.0000000.000000NaN31.000000NaNNaN
max891.0000001.0000003.000000NaNNaN80.0000008.0000006.000000NaN512.329200NaNN

3 检查null值

data1 = data_raw.copy(deep=True)

data1.isnull().sum()

结果:

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。

4 补全空值

data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)

data1.isnull().sum()

补全操作check:

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64

5 特征工程

5.1 干掉 3 列:

drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

5.2 增加 3 列

增加一列 FamilySize

data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1

打印结果:


SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySize
003Braund, Mr. Owen Harrismale22.0107.2500S2
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C2
213Heikkinen, Miss. Lainafemale26.0007.9250S1
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S2
403Allen, Mr. William Henrymale35.0008.0500S1
.................................
88602Montvila, Rev. Juozasmale27.00013.0000S1
88711Graham, Miss. Margaret Edithfemale19.00030.0000S1
88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S4
88911Behr, Mr. Karl Howellmale26.00030.0000C1
89003Dooley, Mr. Patrickmale32.0007.7500Q1
891 rows × 10 columns

再创建一列:

data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)

再创建一列:

data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1

结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneTitle
003Braund, Mr. Owen Harrismale22.0107.2500S20Mr
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C20Mrs
213Heikkinen, Miss. Lainafemale26.0007.9250S11Miss
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S20Mrs
403Allen, Mr. William Henrymale35.0008.0500S11Mr
.......................................
88602Montvila, Rev. Juozasmale27.00013.0000S11Rev
88711Graham, Miss. Margaret Edithfemale19.00030.0000S11Miss
88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S40Miss
88911Behr, Mr. Karl Howellmale26.00030.0000C11Mr
89003Dooley, Mr. Patrickmale32.0007.7500Q11Mr
891 rows × 12 columns

5.3 分箱走起

data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1

结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneTitleFareCutAgeCut
003Braund, Mr. Owen Harrismale22.0107.2500S20Mr(-0.001, 7.91](13.333, 26.667]
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C20Mrs(31.0, 512.329](26.667, 40.0]
213Heikkinen, Miss. Lainafemale26.0007.9250S11Miss(7.91, 14.454](13.333, 26.667]
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S20Mrs(31.0, 512.329](26.667, 40.0]
403Allen, Mr. William Henrymale35.0008.0500S11Mr(7.91, 14.454](26.667, 40.0]
.............................................
88602Montvila, Rev. Juozasmale27.00013.0000S11Rev(7.91, 14.454](26.667, 40.0]
88711Graham, Miss. Margaret Edithfemale19.00030.0000S11Miss(14.454, 31.0](13.333, 26.667]
88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S40Miss(14.454, 31.0](26.667, 40.0]
88911Behr, Mr. Karl Howellmale26.00






请到「今天看啥」查看全文