XGBoostについて（概要と実装）

2021年1月22日
2021年8月8日
機械学習
XGBoost

XGBoostの概要（理論面はのぞく）、使い方、ハイパーパラメータとその調整の仕方についてまとめた。

XGBoostについて（ざっくり）

回帰・クラス分類手法の一つ（データ分析コンペでもよく使われる人気の手法）
決定木ベースの勾配ブースティング手法（Gradient Boosting Dicision Tree）
アンサンブル学習（複数のモデルの多数決で出力を決める手法）が使われている。
（XGBoostでは、ランダムフォレストのように決定木を複数用意して並列化することで出力結果をロバストにしている）

勾配ブースティングとは

ブースティング ：今までに学習したモデルの情報を使って、新たのモデルを構築することでデータの学習を進める方法
（クラス分類で誤分類されたサンプルや、回帰モデルで誤差の大きかったサンプルを改善するように新たなモデルが構築される）
勾配：ブースティングにおいて、前回の学習したモデルの目的関数（損失関数）の勾配（Gradient）を用いて新たなモデルを構築する。

XGBoostの特徴

欠損値を扱える
-学習を防ぐために正則化以外の様々なテクニック（Shrinkage 、特徴量のsubsampling、データのsubsampling）が利用されている。
Shrinkage：
subsampling：ブースティングの際に全ての特徴量・データを使用せず、決められた割合でランダムにピックアップされた特徴量・データを用いて決定木を構築する。

XGBoostのパラメータについて

（参考）

はじめに設定しておくパラメータ

General Pramaeters（XGBoost 全体のパラメータ）

パラメータ名	説明	引数
booster	実行するモデルのタイプ	gbtree：ツリーモデル（デフォルト） gblinear：線形モデル
silent	モデルの実行結果を出力するかどうか	0：実行結果を表示する（デフォルト） 1：実行結果を表示しない
nthread	not set	並列処理のためのコア数\| 何も指定しなければ自動的にフルコア（デフォルト）

Command Line Parameters（コマンドラインパラメータ）

data：トレーニングデータのパス
nrounds：ブースティングを行う回数

Learning Parameters（学習タスクパラメータ）

パラメータ名	説明	引数
objective	最小化させるべき損失関数	デフォルト reg:linear（線形回帰デフォルト） binary:logistic（ロジスティック回帰） binary:ligistic（二項分類で確率を返す） multi:softmax（多項分類でクラスの値を返す）
eval_metric	テストデータに対する評価指標	デフォルト：objectiveパラメータによって決まる mae：平均絶対誤差 rmse：2乗平均平方根誤差 logloss：負の対数尺度 error：2-クラス分類のエラー率 merror：多クラス分類のエラー率 mlogloss：多クラスの対数損失 aue：ROC曲線下の面積で性能の良さを表す
seed	ランダムシード番号	デフォルト：0

細かく設定するパラメータ：
Booster Parameters（ブースターパラメータ）

よく調整するもの

パラメータ名	説明	引数
eta	学習率の調整	範囲：0 ~ 1 デフォルト：0.3
max_depth	決定木の深さの最大値	範囲：0 ~ デフォルト：6
min_child_weight	決定木の葉の重みの下限	範囲：0 ~ デフォルト：1
gamma	決定木の葉の追加による損失減少の下限	範囲：0 ~ デフォルト：0
subsample	各木においてランダムに抽出される標本の割合	範囲：0 ~1 デフォルト：1
colsample_bytree	各木においてランダムに抽出される列（変数）の割合	範囲：0 ~ 1 デフォルト：1
lambda	重みに関するL2正則化項	デフォルト：1
alpha	重みに関するL1正則化項	デフォルト：0

あまり調整しないもの

パラメータ名	説明	引数・傾向
max_leaf_nodes	木の終端ノードの最大値でmax_depthの代わりに用いる
max_delta_step	各木のウェイトの推定に制約をかける（通常は必要とされないが、不均衡データの分類の際に用いる）	範囲：0 ~の整数デフォルト：0（制約なし）
colsample_bylevel	各レベル単位での分割における列（変数）のsubsample比率	範囲：0 ~の整数デフォルト：1 subsampleとcolsample_bytreeで十分なのであまり使われない
scale_pos_weight	不均衡データの際に、0以上の値とすることで収束を早めることができる	デフォルト：1

パラメータとモデルの性質についてまとめ

よく調整するもの

保守的なモデル		過学習しやすい
小 >>>	eta	>>> 大
小 >>>	max_depth	>>> 大
大 <<<	min_child_weight	<<< 小
大 <<<	gamma	<<< 小
小 >>>	subsample	>>> 大
小 >>>	colsample_bytree	>>> 大
大 <<<	lambda	<<< 小
大 <<<	alpha	<<< 小
小 >>>	max_leaf_nodes	>>> 大
大 <<<	max_delta_step	<<< 小
小 >>>	colsample_bylevel	>>> 大

コード（パラメータ最適化なし）

ハイパーパラメータの最適化方法については別記事にまとめる。

XGboostのインストール

pip install xgboost

回帰

ボストンの住宅価格データ（scikit learn のサンプルデータ）を利用する。

import numpy as np
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

# データ読み込み
boston = load_boston()
df_X = pd.DataFrame(boston.data, columns=boston.feature_names)
df_y = pd.Series(boston.target)
print('サンプル数：', df_X.shape[0])
#output>>> サンプル数150
print('特徴量の数：', df_X.shape[1])
#output>>> 特徴量の数:4

#訓練・テストデータに分割
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, shuffle=True)

# xgboostモデルの作成
model = xgb.XGBRegressor(max_depth=3)

#学習
model.fit(X_train, y_train)

# 学習モデルの評価（RMSEを計算）
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
print('RMSE(train data):',round(np.sqrt(mean_squared_error(y_train, y_pred_train)),3))
# output >>> RMSE(train data): 0.562
print('RMSE(test data):',round(np.sqrt(mean_squared_error(y_test, y_pred_test)),3))

# output >>> RMSE(test data): 3.3
# feature importance のプロット
xgb.plot_importance(model)

#数値として取り出したい時
importances = pd.Series(model.feature_importances_, index=df_X.columns)
importances = importances.sort_values()

二値分類

がんの診断結果データ（scikit learn のサンプルデータ）を利用する。

import xgboost as xgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# データ読み込み
breast_cancer = load_breast_cancer()
df_X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
df_y = pd.Series(breast_cancer.target)
print('サンプル数：', df_X.shape[0])
#output>>> サンプル数1797
print('特徴量の数：', df_X.shape[1])
#output>>> 特徴量の数:64

#訓練・テストデータに分割
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, shuffle=True)

# xgboostモデルの作成
model = xgb.XGBClassifier(max_depth=3)

#学習
model.fit(X_train, y_train)

# 学習モデルの評価
y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)
print(confusion_matrix(y_test, y_pred_test))
"""output
[[69  2]
 [ 5 95]]
"""
print(classification_report(y_test, y_pred_test))
"""output
 precision    recall  f1-score   support
           0       0.93      0.97      0.95        71
           1       0.98      0.95      0.96       100
    accuracy                           0.96       171
   macro avg       0.96      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171
"""

多クラス分類

手書き文字（数字）のデータ（scikit learn のサンプルデータ）を利用する。

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#データの準備
iris = load_iris()
df_X = pd.DataFrame(iris.data, columns=iris.feature_names)
df_y = pd.Series(iris.target)
print('サンプル数：', df_X.shape[0])
#output>>> サンプル数150
print('特徴量の数：', df_X.shape[1])

#output>>> 特徴量の数:4
#訓練・テストデータに分割
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, shuffle=True)

# xgboostモデルの作成
model = xgb.XGBClassifier(max_depth=3)

#学習
model.fit(X_train, y_train)

#テストデータの予測
y_pred_test = model.predict(X_test)

#スコアを計算
print(confusion_matrix(y_test, y_pred_test))
"""output
[[15  0  0]
 [ 0 14  2]
 [ 0  0 14]]
"""
print(classification_report(y_test, y_pred_test))
"""output
precision    recall  f1-score   support
           0       1.00      1.00      1.00        15
           1       1.00      0.88      0.93        16
           2       0.88      1.00      0.93        14
    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45
"""

XGBoostについて（概要と実装）

XGBoostについて（ざっくり）

勾配ブースティングとは

XGBoostの特徴

XGBoostのパラメータについて

はじめに設定しておくパラメータ

General Pramaeters（XGBoost 全体のパラメータ）

Command Line Parameters（コマンドラインパラメータ）

Learning Parameters（学習タスクパラメータ）

細かく設定するパラメータ： Booster Parameters（ブースターパラメータ）

よく調整するもの

あまり調整しないもの

パラメータとモデルの性質についてまとめ

よく調整するもの

コード（パラメータ最適化なし）

XGboostのインストール

回帰

二値分類

多クラス分類

参考

細かく設定するパラメータ：
Booster Parameters（ブースターパラメータ）