专栏名称: 挖地兔

金融数据采集与挖掘，开启量化金融的第一扇大门。

机器学习必备技能之“数据预处理”

挖地兔 · 公众号 · · 2019-08-06 05:08

正文

T U SHARE 金融与技术学习兴趣小组

编译整理 | 一只小绿怪兽

译者简介：北京第二外国语学院国际商务专业研一在读，目前在学习Python编程和量化投资相关知识。

作者：Datacamp

Machine Learning 机器学习 ，是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。

它是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域，它主要使用归纳、综合而不是演绎。

上面的官方定义看起来很吓人，简单来说， 机器学习的核心是，让机器通过算法从现有数据中学习，然后对新数据做出 预测。

在利用机器学习模型进行预测之前，非常重要的一步是对要输入的数据进行预处理，也就是本文要介绍的内容，主要包括对数据进行标准化、创建最有代表性的特征、并选择适合模型的最佳特征。

【工具】Python 3

【数据】 tushare.pro 、 Datacamp

【注意】本文注重的是方法的讲解，请大家灵活掌握。

数据清洗

在做任何分析之前，第一步，都是要对数据进行清洗。关于常见的数据清洗方法，在文章 《这些方法解决了数据清洗80%的工作量》 中已经介绍过了，这里就简单回顾一下。

① 删除某列有缺失值的行。

print(volunteer.head())

   opportunity_id  content_id  vol_requests  event_time ... Census Tract  BIN BBL NTA
0            4996       37004            50           0 ...          NaN  NaN NaN NaN
1            5008       37036             2           0 ...          NaN  NaN NaN NaN
2            5016       37143            20           0 ...          NaN  NaN NaN NaN
3            5022       37237           500           0 ...          NaN  NaN NaN NaN
4            5055       37425            15           0 ...          NaN




    
  NaN NaN NaN

[5 rows x 35 columns]


print(volunteer.shape)
(665, 35)

# 查看'category_desc'列有多少缺失值
print(volunteer['category_desc'].isnull().sum())
# 选中'category_desc'列不为空值的行
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]
print(volunteer_subset.shape)

48
(617, 35)

② 用 .astype() 转换某列的数据类型。

print(volunteer.dtypes) 

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                   object

# 查看hits列的前5行
print(volunteer["hits"].head())

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64

# 把hits列的数据类型转换成整数int 
volunteer["hits"] = volunteer["hits"].astype('int')

# 查看数据类型
print(volunteer.dtypes)

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64

③ 划分训练集和测试集。

我们将数据分成两个部分，训练集用于建立模型，而测试集则用于评估模型的预测能力，这么做的目的是防止模型的过拟合。

sklearn 是机器学习中一个常用的Python第三方模块，我们可以直接调用这个模块中的 train_test_split 【1】函数对数据集进行划分。默认是将数据集大小的75%设置为训练集，25%设置为测试集，如下示例。

from sklearn.model_selection import train_test_split
import numpy as np

X, y = np.arange(8).reshape((4, 2)), range(4)
print(X)
print(list(y))

[[0 1]
 [2 3]
 [4 5]
 [6 7]]
[0, 1, 2, 3]

X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train,  X_test, y_train, y_test)

[[4 5]
 [6 7]
 [2 3]] [[0 1]] [2, 3, 1] [0]

在大多情况下，使用 train_test_split 函数中的默认参数设置不会有什么问题，但如果数据集不是均匀分布的，则划分的训练集和测试集中的数据可能就不具有代表性，会使模型的预测效果出现误差。

这时，分层抽样是一个更好的选择，可以通过设置参数 stratify 来实现。

下面的示例中，列class里面有100个样本，80个class1和20个class2，我们希望通过分层抽样，得到这样的划分：

【训练集】75个样本，60个class1，15个class2

【测试集】25个样本，20个class1，5个class2

print(df['class'




    
].value_counts())

class1    80
class2    20
Name: class, dtype: int64

X = df[['number']]
y = df[['class']]

# 设置参数stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
print(y_train['class'].value_counts())
print(y_test['class'].value_counts())

class1    60
class2    15
Name: class, dtype: int64
class1    20
class2     5
Name: class, dtype: int64

# 如果未设置stratify，效果如下
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(y_train['class'].value_counts())
print(y_test['class'].value_counts())

class1    57
class2    18
Name: class, dtype: int64
class1    23
class2     2
Name: class, dtype: int64

数据标准化

当模型属于线性空间模型；数据集中的某个特征方差很大；数据集的特征是连续的且量纲不同；或者存在线性假定的时候，我们需要进行数据标准化处理。

本文主要介绍两种方法，一个是 log函数标准化法 ，另一个是 特征缩放法 。

① log函数标准化法

如果发现数据集中某一列特征的方差很大，可以用log函数进行处理，在Python中 log() 函数默认是以e为底的对数函数。

从下面的示例中可以观察到，在没有用 log() 函数进行处理之前， col2列数据的方差很大，但经过log()函数处理之后，两列数据方差的差距明显缩小。

   col1   col2
0  1.00    3.0
1  1.20   45.5
2  0.75   28.0
3  1.60  100.0

print(df.var())
col1       0.128958
col2    1691.729167
dtype: float64

import numpy as np
df['log_2'] = np.log(df['col2'])
print(df)

   col1   col2     log_2
0  1.00    3.0  1.098612
1  1.20   45.5  3.817712
2  0.75   28.0  3.332205
3




    
  1.60  100.0  4.605170

print(np.var(df[['col1', 'log_2']]))
col1     0.096719
log_2    1.697165
dtype: float64

② 特征缩放法

当特征的量纲不同、使用线性模型时，可以用特征缩放的方法进行标准化，目的是把数据转化成标准正态分布，可以通过调用 sklearn 中的 StandardScaler 类实现。

在如下示例中，可以观察到df中各列内部数据的差距不是很大，但是列与列之间的大小却有明显差距，需要进行标准化处理。

   col1  col2   col3
0  1.00  48.0  100.0
1  1.20  45.5  101.3
2  0.75  46.2  103.5
3  1.60  50.0  104.0

print(df.var())

col1    0.128958
col2    4.055833
col3    3.526667
dtype: float64


from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df), columns=df.columns)

print(df_scaled)

       col1      col2      col3
0 -0.442127  0.329683 -1.352726
1  0.200967 -1.103723 -0.553388
2 -1.245995 -0.702369  0.799338
3  1.487156  1.476409  1.106776

print(df.var())

col1    1.333333
col2    1.333333
col3    1.333333
dtype: float64

特征工程

特征工程是根据原始特征创建新特征的过程，目的是让特征更准确地预测未知数据，这需要我们对数据集有很深入的理解和把握。

比如说，如果目标是想要评估一个班级整体的学习情况，那每个学生的考试成绩本身是没有太大参考价值的，而平均值是更好的一个选择。

不同的数据集和模型所采用的特征工程方法是不同的，本文只介绍其中的几个供大家参考。

情景一：分类变量型特征

分类变量一般是文本数据，需要先转化成数字，再输入到模型中，可以通过 Pandas 和 sklearn 两种方式实现。

① 调用sklearn中的 LabelEncoder 函数。

hiking[["Accessible"]].head()

  Accessible
0          Y
1          N
2          N
3          N
4          N

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

print(hiking[["Accessible_enc", "Accessible"]].head())

   Accessible_enc Accessible
0               1          Y
1               0          N
2               0          N
3               0          N
4               0          N

② 调用Pandas中的 get_dummies() 【2】函数。这里我们导入 tushare.pro 中的行业数据作为示例进行演示。

import tushare as ts

pro = ts.pro_api()

df = pro.stock_basic(exchange='', list_status='L', fields='ts_code, industry')
df = df.head()
print(df)

     ts_code industry
0  000001.SZ       银行
1  000002.SZ     全国地产
2  000004.SZ     生物制药
3  000005.SZ     环境保护
4  000006.SZ     区域地产

df_enc = pd.get_dummies(df['industry'])
df_enc.index = df['ts_code']
print(df_enc)

           全国地产  区域地产  环境保护  生物制药  银行
ts_code                              
000001.SZ     0     0     0     0   1
000002.SZ     1     0     0     0   0
000004.SZ     0     0     0     1   0
000005.SZ     0     0     1     0   0
000006.SZ     0     1     0     0   0

情景二：数字型特征

① 取平均值

print(running_times_5k)

      name  run1  run2  run3  run4  run5   
0




    
      Sue  20.1  18.5  19.6  20.3  18.3  
1     Mark  16.5  17.1  16.9  17.6  17.3  
2     Sean  23.5  25.1  25.2  24.6  23.9 
3     Erin  21.7  21.1  20.9  22.1  22.2  
4    Jenny  25.8  27.1  26.1  26.7  26.9  
5  Russell  30.9  29.6  31.4  30.4  29.9  


run_columns = ["run1", "run2", "run3", "run4", "run5"]
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

print(running_times_5k)

      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44

② 提取日期中的月份

import tushare as ts

pro = ts.pro_api()

df = pro.daily(ts_code='000001.SZ', start_date='20180701', end_date='20180706')[['ts_code', 'trade_date', 'close']]
df.sort_values('trade_date', inplace=True)                              # 升序排列
df['trade_date'] = pd.to_datetime(df['trade_date'

机器学习必备技能之“数据预处理”

正文

请到「今天看啥」查看全文