机器学习：决策树

特征选择

1.信息增益

Gain(D, a) 的值越大，样本集按该属性划分后纯度的提升越高。由此可找到最合适的划分属性。

2.基尼系数

Gini_index(D, a) 的值越小，样本集按该离散属性划分后纯度的提升越高。由此可找到最合适的划分属性。

3.均方误差

MSE(D, a) 的值越小，决策树对样本集的拟合程度越高。由此可找到最合适的划分属性。

决策树实现

``````from sklearn import tree

import os
import pandas as pd
import numpy as np
import sklearn
import xgboost as xgb

from utils.features import *

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)

features = darshan_features

df_train, df_test = sklearn.model_selection.train_test_split(df, test_size=0.2)

X_train, X_test = df_train[features], df_test[features]

print(X_test)
y_train, y_test = df_train["value"], df_test["value"]

print(y_test)

return X_train, X_test, y_train, y_test

def model_train(X_train, X_test, y_train, y_test):
# 决策树回归
clf = tree.DecisionTreeRegressor()
# 拟合数据
clf = clf.fit(X_train, y_train)

y_pred_test = clf.predict(X_test)

print(y_test)
print(y_pred_test)
error = np.median(10 ** np.abs(y_test - y_pred_test))
print(error)

def main():
X_train, X_test, y_train, y_test = load_datasets()
model_train(X_train, X_test, y_train, y_test)

if __name__ == "__main__":
main()
``````

THE END