猿创征文｜机器学习实战（7）——集成学习

1 投票分类器

2 bagging and pasting

3 包外评估

4 疑问解答

``````# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)``````

1 投票分类器

``````from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)

voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')

voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))``````

``````LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896``````

``````log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", probability=True, random_state=42)

voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='soft')
voting_clf.fit(X_train, y_train)

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))``````

``````LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912``````

2 bagging and pasting

Scikit-Learn中可以用 BaggingClassifier 类进行 bagging 和 pasting（或BaggingRegressor用于回归）。下面的代码训练了一个包含500个决策树分类器的集成，每次随机从训练集中采样100个训练实例进行训练（max_ssamples可以在0.0到1.0之间灵活设置，而每次采样的最大实例数量等于训练集的大小乘以max_ssamples），然后放回（bagging的一个示例，如果我们想要使用pasting，只需要设置bootstrap=False即可）。参数n_jobs用来表示Scikit-Learn用多少CPU内核进行训练和预测。

``````from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
DecisionTreeClassifier(random_state=42), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
x1s = np.linspace(axes[0], axes[1], 100)
x2s = np.linspace(axes[2], axes[3], 100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
if contour:
custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
plt.axis(axes)
plt.xlabel(r"\$x_1\$", fontsize=18)
plt.ylabel(r"\$x_2\$", fontsize=18, rotation=0)

plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=14)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=14)
plt.show()``````

``````from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred),accuracy_score(y_test, y_pred_tree)``````

``(0.904, 0.856)``

3 包外评估

``````bag_clf = BaggingClassifier(
DecisionTreeClassifier(random_state=42), n_estimators=500,
bootstrap=True, n_jobs=-1, oob_score=True, random_state=40)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_``````

``````0.8986666666666666
``````

``````y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)``````

``0.912``

4 疑问解答

• 如果我们已经在完全相同的训练集上训练了几个不同的模型，并且这几个模型都达到了90%多的准确率，那么我们是否还需要结合这些模型来获得更好的结果呢？

答：我们可以尝试把他们组成一个投票集成，这样效果可能更优。如果模型之间非常不同或者是在不同的训练实例上完成的训练，那么我们将会取得更好的效果。

• 硬投票和软投票分类器的区别。

答：硬投票分类器只是统计每个分类器的投票，然后挑选出得票最多的类别。软投票分类器将计算出每个类别的平均估算概率，然后挑选出概率最高的类别。软投票表现效果更优，但是它要求每个分类器都能够估算出类别概率才可以正常工作。

• 包外评估的好处。

答：包外评估可以对 bagging 集成中的每个预测器使用未经训练的实例进行评估。且不需要额外的验证集，就可以集成实时相当公正的评估。所以，如果训练使用的实例越多，集成的性能可以略有提升。

THE END

)">