Auto Byte

Science AI

Alexander Cheng作者高璇 思参与

# 机器学习第一步，这是一篇手把手的随机森林入门实战

1. 随机森林
2. 具有 PCA 降维的随机森林
3. 具有 PCA 降维和超参数调整的随机森林

• 使用 df.info（）可以了解每一列中的数据类型和数据量。可能需要根据需要转换数据类型。

• 使用 df.isna（）确保没有 NaN 值。可能需要根据需要处理缺失值或删除行。

• 使用 df.describe（）可以了解每列的最小值、最大值、均值、中位数、标准差和四分位数范围。

``import pandas as pd``from sklearn.datasets import load_breast_cancercolumns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']dataset = load_breast_cancer()``data = pd.DataFrame(dataset['data'], columns=columns)``data['cancer'] = dataset['target']display(data.head())``display(data.info())``display(data.isna().sum())``display(data.describe())``

``from sklearn.model_selection import train_test_splitX = data.drop('cancer', axis=1)  ``y = data['cancer'] ``X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state = 2020, stratify=y)``

``import numpy as np``from sklearn.preprocessing import StandardScalerss = StandardScaler()``X_train_scaled = ss.fit_transform(X_train)``X_test_scaled = ss.transform(X_test)``y_train = np.array(y_train)``

``from sklearn.ensemble import RandomForestClassifier``from sklearn.metrics import recall_scorerfc = RandomForestClassifier()``rfc.fit(X_train_scaled, y_train)``display(rfc.score(X_train_scaled, y_train))# 1.0``

``feats = {}``for feature, importance in zip(data.columns, rfc_1.feature_importances_):``feats[feature] = importanceimportances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'})``importances = importances.sort_values(by='Gini-Importance', ascending=False)``importances = importances.reset_index()``importances = importances.rename(columns={'index': 'Features'})sns.set(font_scale = 5)``sns.set(style="whitegrid", color_codes=True, font_scale = 1.7)``fig, ax = plt.subplots()``fig.set_size_inches(30,15)``sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='skyblue')``plt.xlabel('Importance', fontsize=25, weight = 'bold')``plt.ylabel('Features', fontsize=25, weight = 'bold')``plt.title('Feature Importance', fontsize=25, weight = 'bold')display(plt.show())``display(importances)``

``import matplotlib.pyplot as plt``import seaborn as sns``from sklearn.decomposition import PCApca_test = PCA(n_components=30)``pca_test.fit(X_train_scaled)sns.set(style='whitegrid')``plt.plot(np.cumsum(pca_test.explained_variance_ratio_))``plt.xlabel('number of components')``plt.ylabel('cumulative explained variance')``plt.axvline(linewidth=4, color='r', linestyle = '--', x=10, ymin=0, ymax=1)``display(plt.show())evr = pca_test.explained_variance_ratio_``cvr = np.cumsum(pca_test.explained_variance_ratio_)pca_df = pd.DataFrame()``pca_df['Cumulative Variance Ratio'] = cvr``pca_df['Explained Variance Ratio'] = evr``display(pca_df.head(10))``

``pca = PCA(n_components=10)``pca.fit(X_train_scaled)X_train_scaled_pca = pca.transform(X_train_scaled)``X_test_scaled_pca = pca.transform(X_test_scaled)``

``pca_dims = []``for x in range(0, len(pca_df)):``pca_dims.append('PCA Component {}'.format(x))pca_test_df = pd.DataFrame(pca_test.components_, columns=columns, index=pca_dims)``pca_test_df.head(10).T``

PCA 后拟合「基线」随机森林模型

``rfc = RandomForestClassifier()``rfc.fit(X_train_scaled_pca, y_train)display(rfc.score(X_train_scaled_pca, y_train))# 1.0``

• n_estimators：随机森林中「树」的数量。

• max_features：每个分割处的特征数。

• max_depth：每棵树可以拥有的最大「分裂」数。

• min_samples_split：在树的节点分裂前所需的最少观察数。

• min_samples_leaf：每棵树末端的叶节点所需的最少观察数。

• bootstrap：是否使用 bootstrapping 来为随机林中的每棵树提供数据。（bootstrapping 是从数据集中进行替换的随机抽样。）

``from sklearn.model_selection import RandomizedSearchCVn_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]max_features = ['log2', 'sqrt']max_depth = [int(x) for x in np.linspace(start = 1, stop = 15, num = 15)]min_samples_split = [int(x) for x in np.linspace(start = 2, stop = 50, num = 10)]min_samples_leaf = [int(x) for x in np.linspace(start = 2, stop = 50, num = 10)]bootstrap = [True, False]param_dist = {'n_estimators': n_estimators,``'max_features': max_features,``'max_depth': max_depth,``'min_samples_split': min_samples_split,``'min_samples_leaf': min_samples_leaf,``'bootstrap': bootstrap}rs = RandomizedSearchCV(rfc_2, ``param_dist, ``n_iter = 100, ``cv = 3, ``verbose = 1, ``n_jobs=-1, ``random_state=0)rs.fit(X_train_scaled_pca, y_train)``rs.best_params_``————————————————————————————————————————————``# {'n_estimators': 700,``# 'min_samples_split': 2,``# 'min_samples_leaf': 2,``# 'max_features': 'log2',``# 'max_depth': 11,``# 'bootstrap': True}``

``rs_df = pd.DataFrame(rs.cv_results_).sort_values('rank_test_score').reset_index(drop=True)``rs_df = rs_df.drop([``'mean_fit_time', ``'std_fit_time', ``'mean_score_time',``'std_score_time', ``'params', ``'split0_test_score', ``'split1_test_score', ``'split2_test_score', ``'std_test_score'],``axis=1)``rs_df.head(10)``

``fig, axs = plt.subplots(ncols=3, nrows=2)``sns.set(style="whitegrid", color_codes=True, font_scale = 2)``fig.set_size_inches(30,25)sns.barplot(x='param_n_estimators', y='mean_test_score', data=rs_df, ax=axs[0,0], color='lightgrey')``axs[0,0].set_ylim([.83,.93])axs[0,0].set_title(label = 'n_estimators', size=30, weight='bold')sns.barplot(x='param_min_samples_split', y='mean_test_score', data=rs_df, ax=axs[0,1], color='coral')``axs[0,1].set_ylim([.85,.93])axs[0,1].set_title(label = 'min_samples_split', size=30, weight='bold')sns.barplot(x='param_min_samples_leaf', y='mean_test_score', data=rs_df, ax=axs[0,2], color='lightgreen')``axs[0,2].set_ylim([.80,.93])axs[0,2].set_title(label = 'min_samples_leaf', size=30, weight='bold')sns.barplot(x='param_max_features', y='mean_test_score', data=rs_df, ax=axs[1,0], color='wheat')``axs[1,0].set_ylim([.88,.92])axs[1,0].set_title(label = 'max_features', size=30, weight='bold')sns.barplot(x='param_max_depth', y='mean_test_score', data=rs_df, ax=axs[1,1], color='lightpink')``axs[1,1].set_ylim([.80,.93])axs[1,1].set_title(label = 'max_depth', size=30, weight='bold')sns.barplot(x='param_bootstrap',y='mean_test_score', data=rs_df, ax=axs[1,2], color='skyblue')``axs[1,2].set_ylim([.88,.92])``

• n_estimators：300、500、700 的平均分数几乎最高；

• min_samples_split：较小的值（如 2 和 7）得分较高。23 处得分也很高。我们可以尝试一些大于 2 的值，以及 23 附近的值；

•  min_samples_leaf：较小的值可能得到更高的分，我们可以尝试使用 2–7 之间的值；

•  max_features：「sqrt」具有最高平均分；

•  max_depth：没有明确的结果，但是 2、3、7、11、15 的效果很好；

• bootstrap：「False」具有最高平均分。

``from sklearn.model_selection import GridSearchCVn_estimators = [300,500,700]``max_features = ['sqrt']``max_depth = [2,3,7,11,15]``min_samples_split = [2,3,4,22,23,24]``min_samples_leaf = [2,3,4,5,6,7]``bootstrap = [False]param_grid = {'n_estimators': n_estimators,``'max_features': max_features,``'max_depth': max_depth,``'min_samples_split': min_samples_split,``'min_samples_leaf': min_samples_leaf,``'bootstrap': bootstrap}gs = GridSearchCV(rfc_2, param_grid, cv = 3, verbose = 1, n_jobs=-1)``gs.fit(X_train_scaled_pca, y_train)``rfc_3 = gs.best_estimator_``gs.best_params_``————————————————————————————————————————————``# {'bootstrap': False,``# 'max_depth': 7,``# 'max_features': 'sqrt',``# 'min_samples_leaf': 3,``# 'min_samples_split': 2,``# 'n_estimators': 500}``

• 基线随机森林

• 具有 PCA 降维的基线随机森林

• 具有 PCA 降维和超参数调优的基线随机森林

``y_pred = rfc.predict(X_test_scaled)``y_pred_pca = rfc.predict(X_test_scaled_pca)``y_pred_gs = gs.best_estimator_.predict(X_test_scaled_pca)``

``from sklearn.metrics import confusion_matrixconf_matrix_baseline = pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])conf_matrix_baseline_pca = pd.DataFrame(confusion_matrix(y_test, y_pred_pca), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])conf_matrix_tuned_pca = pd.DataFrame(confusion_matrix(y_test, y_pred_gs), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])display(conf_matrix_baseline)``display('Baseline Random Forest recall score', recall_score(y_test, y_pred))``display(conf_matrix_baseline_pca)``display('Baseline Random Forest With PCA recall score', recall_score(y_test, y_pred_pca))``display(conf_matrix_tuned_pca)``display('Hyperparameter Tuned Random Forest With PCA Reduced Dimensionality recall score', recall_score(y_test, y_pred_gs))``

https://towardsdatascience.com/machine-learning-step-by-step-6fbde95c455a

rfc_1没有定义呀，代码不全