专栏名称: 机器学习研究会
机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织,旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外,协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。
目录
相关文章推荐
爱可可-爱生活  ·  分量十足:根本吃不完 ... ·  2 天前  
宝玉xp  ·  //@爱水de鱼儿:好答案来自于好问题 ... ·  2 天前  
爱可可-爱生活  ·  【[661星]Meridian:谷歌推出的营 ... ·  4 天前  
爱可可-爱生活  ·  【[832星]Swift ... ·  4 天前  
51好读  ›  专栏  ›  机器学习研究会

[比赛记录] 主流机器学习模型模板代码+经验分享[xgb, lgb, Keras, LR]

机器学习研究会  · 公众号  · AI  · 2017-12-15 22:35

正文

最近打各种比赛,在这里分享一些General Model,稍微改改就能用的

XGBoost调参大全: http://blog.csdn.net/han_xiaoyang/article/details/52665396
XGBoost 官方API:
http://xgboost.readthedocs.io/en/latest//python/python_api.html

Preprocess


# 通用的预处理框架


import pandas as pd

import numpy as np

import scipy as sp


# 文件读取

def read_csv_file(f, logging=False):

print("==========读取数据=========")

data =  pd.read_csv(f)

if logging:

print(data.head(5))

print(f, "包含以下列")

print(data.columns.values)

print(data.describe())

print(data.info())

return data



Logistic Regression


# 通用的LogisticRegression框架


import pandas as pd

import numpy as np

from scipy import sparse

from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler


# 1. load data

df_train = pd.DataFrame()

df_test  = pd.DataFrame()

y_train = df_train['label'].values


# 2. process data

ss = StandardScaler()



# 3. feature engineering/encoding

# 3.1 For Labeled Feature

enc = OneHotEncoder()

feats = ["creativeID", "adID", "campaignID"]

for i, feat in enumerate(feats):

x_train = enc.fit_transform(df_train[feat].values.reshape(-1, 1))

x_test = enc.fit_transform(df_test[feat].values.reshape(-1, 1))

if i == 0:

X_train, X_test = x_train, x_test

else:

X_train, X_test = sparse.hstack((X_train, x_train)), sparse.hstack((X_test, x_test))


# 3.2 For Numerical Feature

# It must be a 2-D Data for StandardScalar, otherwise reshape(-1, len(feats)) is required

feats = ["price", "age"]

x_train = ss.fit_transform(df_train[feats].values)

x_test  = ss.fit_transform(df_test[feats].values)

X_train, X_test = sparse.hstack((X_train, x_train)), sparse.hstack((X_test, x_test))


# model training

lr = LogisticRegression()

lr.fit(X_train, y_train)

proba_test = lr.predict_proba(X_test)[:, 1]




LightGBM


1. 二分类


import lightgbm as lgb

import pandas as pd

import numpy as np

import pickle

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import train_test_split


print("Loading Data ... ")


# 导入数据

train_x, train_y, test_x = load_data()


# 用sklearn.cross_validation进行训练数据集划分,这里训练集和交叉验证集比例为7:3,可以自己根据需要设置

X, val_X, y, val_y = train_test_split(

train_x,

train_y,

test_size=0.05,

random_state=1,

stratify=train_y ## 这里保证分割后y的比例分布与原数据一致

)


X_train = X

y_train = y

X_test = val_X

y_test = val_y



# create dataset for lightgbm

lgb_train = lgb.Dataset(X_train, y_train)

lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict

params = {

'boosting_type': 'gbdt',

'objective': 'binary',

'metric': {'binary_logloss', 'auc'},

'num_leaves': 5,

'max_depth': 6,

'min_data_in_leaf': 450,







请到「今天看啥」查看全文