久久亚洲精中文字幕冲田杏梨,久热国产在线视频,欧美日韩精品在线

python sklearn庫是一個豐富的機器學習庫，里面包含內容太多，這里對一些工程里常用的操作做個簡要的概述，以后還會根據(jù)自己用的進行更新。

1、labelencoder

簡單來說 labelencoder 是對不連續(xù)的數(shù)字或者文本進行按序編號，可以用來生成屬性/標簽

				?

									from sklearn.preprocessing import labelencoder

									encoder=labelencoder()

									encoder.fit([1,3,2,6])

									t=encoder.transform([1,6,6,2])

									print(t)

輸出： [0 3 3 1]

2、onehotencoder

onehotencoder 用于將表示分類的數(shù)據(jù)擴維，將[[1]，[2]，[3]，[4]]映射為 0,1,2,3的位置為1（高維的數(shù)據(jù)自己可以測試）：

				?

									from sklearn.preprocessing import onehotencoder

									onehot=onehotencoder()#聲明一個編碼器

									onehot.fit([[1],[2],[3],[4]])

									print(onehot.transform([[2],[3],[1],[4]]).toarray())

輸出：[[0. 1. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]]
正如keras中的keras.utils.to_categorical(y_train, num_classes)

3、sklearn.model_selection.train_test_split隨機劃分訓練集和測試集

一般形式：
train_test_split是交叉驗證中常用的函數(shù)，功能是從樣本中隨機的按比例選取train data和testdata，形式為：

				?

									x_train,x_test, y_train, y_test =train_test_split(train_data,train_target,test_size=0.2, train_size=0.8,random_state=0)

參數(shù)解釋：
- train_data：所要劃分的樣本特征集
- train_target：所要劃分的樣本結果
- test_size：測試樣本占比，如果是整數(shù)的話就是樣本的數(shù)量

-train_size：訓練樣本的占比，（注：測試占比和訓練占比任寫一個就行）
- random_state：是隨機數(shù)的種子。
- 隨機數(shù)種子：其實就是該組隨機數(shù)的編號，在需要重復試驗的時候，保證得到一組一樣的隨機數(shù)。比如你每次都填1，其他參數(shù)一樣的情況下你得到的隨機數(shù)組是一樣的。但填0或不填，每次都會不一樣。
隨機數(shù)的產(chǎn)生取決于種子，隨機數(shù)和種子之間的關系遵從以下兩個規(guī)則：
- 種子不同，產(chǎn)生不同的隨機數(shù)；種子相同，即使實例不同也產(chǎn)生相同的隨機數(shù)。

				?

									from sklearn.model_selection import train_test_split

									from sklearn.datasets import load_iris

									iris=load_iris()

									train=iris.data

									target=iris.target

									# 避免過擬合，采用交叉驗證，驗證集占訓練集20%，固定隨機種子（random_state)

									train_x,test_x, train_y, test_y = train_test_split(train,

									                                                   target,

									                                                   test_size = 0.2,

									                                                   random_state = 0)

									print(train_y.shape)

得到的結果數(shù)據(jù)：train_x : 訓練集的數(shù)據(jù),train_y：訓練集的標簽，對應test 為測試集的數(shù)據(jù)和標簽

4、pipeline

本節(jié)參考與文章：用 pipeline 將訓練集參數(shù)重復應用到測試集
pipeline 實現(xiàn)了對全部步驟的流式化封裝和管理，可以很方便地使參數(shù)集在新數(shù)據(jù)集上被重復使用。

pipeline 可以用于下面幾處：

模塊化 feature transform，只需寫很少的代碼就能將新的 feature 更新到訓練集中。
自動化 grid search，只要預先設定好使用的 model 和參數(shù)的候選，就能自動搜索并記錄最佳的 model。
自動化 ensemble generation，每隔一段時間將現(xiàn)有最好的 k 個 model 拿來做 ensemble。

問題是要對數(shù)據(jù)集 breast cancer wisconsin 進行分類，
該數(shù)據(jù)集包含 569 個樣本，第一列 id，第二列類別(m=惡性腫瘤，b=良性腫瘤)，
第 3-32 列是實數(shù)值的特征。

我們要用 pipeline 對訓練集和測試集進行如下操作：

先用 standardscaler 對數(shù)據(jù)集每一列做標準化處理，（是 transformer）
再用 pca 將原始的 30 維度特征壓縮的 2 維度，（是 transformer）
最后再用模型 logisticregression。（是 estimator）
調用 pipeline 時，輸入由元組構成的列表，每個元組第一個值為變量名，元組第二個元素是 sklearn 中的 transformer
或 estimator。

注意中間每一步是 transformer，即它們必須包含 fit 和 transform 方法，或者 fit_transform。
最后一步是一個 estimator，即最后一步模型要有 fit 方法，可以沒有 transform 方法。

然后用 pipeline.fit對訓練集進行訓練，pipe_lr.fit(x_train, y_train)
再直接用 pipeline.score 對測試集進行預測并評分 pipe_lr.score(x_test, y_test)

				?

									import pandas as pd

									from sklearn.model_selection import train_test_split

									from sklearn.preprocessing import labelencoder

									from sklearn.preprocessing import standardscaler

									from sklearn.decomposition import pca

									from sklearn.linear_model import logisticregression

									from sklearn.pipeline import pipeline

									#需要聯(lián)網(wǎng)

									df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',

									                 header=none)

									                                 # breast cancer wisconsin dataset

									x, y = df.values[:, 2:], df.values[:, 1]

									encoder = labelencoder()

									y = encoder.fit_transform(y)

									x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

									pipe_lr = pipeline([('sc', standardscaler()),

									                    ('pca', pca(n_components=2)),

									                    ('clf', logisticregression(random_state=1))

									                    ])

									pipe_lr.fit(x_train, y_train)

									print('test accuracy: %.3f' % pipe_lr.score(x_test, y_test))

還可以用來選擇特征：

例如用 selectkbest 選擇特征，
分類器為 svm，

				?

									anova_filter = selectkbest(f_regression, k=5)

									clf = svm.svc(kernel='linear')

									anova_svm = pipeline([('anova', anova_filter), ('svc', clf)])

當然也可以應用 k-fold cross validation：

pipeline 的工作方式：

當管道 pipeline 執(zhí)行 fit 方法時，
首先 standardscaler 執(zhí)行 fit 和 transform 方法，
然后將轉換后的數(shù)據(jù)輸入給 pca，
pca 同樣執(zhí)行 fit 和 transform 方法，
再將數(shù)據(jù)輸入給 logisticregression，進行訓練。

一文搞懂Python Sklearn庫使用

5 perdict 直接返回預測值

predict_proba返回每組數(shù)據(jù)預測值的概率，每行的概率和為1，如訓練集/測試集有下例中的兩個類別，測試集有三個，則 predict返回的是一個 3*1的向量，而 predict_proba 返回的是 3*2維的向量，如下結果所示。

				?

									# conding :utf-8

									from sklearn.linear_model import logisticregression

									import numpy as np

									x_train = np.array([[1, 2, 3],

									                    [1, 3, 4],

									                    [2, 1, 2],

									                    [4, 5, 6],

									                    [3, 5, 3],

									                    [1, 7, 2]])

									y_train = np.array([3, 3, 3, 2, 2, 2])

									x_test = np.array([[2, 2, 2],

									                   [3, 2, 6],

									                   [1, 7, 4]])

									clf = logisticregression()

									clf.fit(x_train, y_train)

									# 返回預測標簽

									print(clf.predict(x_test))

									# 返回預測屬于某標簽的概率

									print(clf.predict_proba(x_test))

一文搞懂Python Sklearn庫使用

6 sklearn.metrics中的評估方法

1. sklearn.metrics.roc_curve(true_y. pred_proba_score, pos_labal)

計算roc曲線，roc曲線有三個屬性：fpr, tpr,和閾值，因此該函數(shù)返回這三個變量,l

2. sklearn.metrics.auc(x, y, reorder=false):

計算auc值，其中x,y分別為數(shù)組形式，根據(jù)(xi, yi)在坐標上的點，生成的曲線，然后計算auc值；

				?

									import numpy as np

									from sklearn.metrics import roc_curve

									from sklearn.metrics import auc

									y = np.array([1,0,2,2])

									pred = np.array([0.1, 0.4, 0.35, 0.8])

									fpr, tpr, thresholds = roc_curve(y, pred, pos_label=2)

									print(tpr)

									print(fpr)

									print(thresholds)

									print(auc(fpr, tpr))

3. sklearn.metrics.roc_auc_score(true_y, pred_proba_y)

直接根據(jù)真實值（必須是二值）、預測值（可以是0/1, 也可以是proba值）計算出auc值，中間過程的roc計算省略

7 gridsearchcv

gridsearchcv，它存在的意義就是自動調參，只要把參數(shù)輸進去，就能給出最優(yōu)化的結果和參數(shù)。但是這個方法適合于小數(shù)據(jù)集，一旦數(shù)據(jù)的量級上去了，很難得出結果。這個時候就是需要動腦筋了。數(shù)據(jù)量比較大的時候可以使用一個快速調優(yōu)的方法——坐標下降。它其實是一種貪心算法：拿當前對模型影響最大的參數(shù)調優(yōu)，直到最優(yōu)化；再拿下一個影響最大的參數(shù)調優(yōu)，如此下去，直到所有的參數(shù)調整完畢。這個方法的缺點就是可能會調到局部最優(yōu)而不是全局最優(yōu)，但是省時間省力，巨大的優(yōu)勢面前，還是試一試吧，后續(xù)可以再拿bagging再優(yōu)化。

回到sklearn里面的gridsearchcv，gridsearchcv用于系統(tǒng)地遍歷多種參數(shù)組合，通過交叉驗證確定最佳效果參數(shù)。

gridsearchcv的sklearn官方網(wǎng)址：http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.gridsearchcv.html#sklearn.model_selection.gridsearchcv

classsklearn.model_selection.gridsearchcv(estimator,param_grid, scoring=none, fit_params=none, n_jobs=1, iid=true, refit=true,cv=none, verbose=0, pre_dispatch='2*n_jobs', error_score='raise',return_train_score=true)

常用參數(shù)解讀

estimator：所使用的分類器，如estimator=randomforestclassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10), 并且傳入除需要確定最佳的參數(shù)之外的其他參數(shù)。每一個分類器都需要一個scoring參數(shù)，或者score方法。

param_grid：值為字典或者列表，即需要最優(yōu)化的參數(shù)的取值，param_grid =param_test1，param_test1 = {'n_estimators':range(10,71,10)}。

scoring :準確度評價標準，默認none,這時需要使用score函數(shù)；或者如scoring='roc_auc'，根據(jù)所選模型不同，評價準則不同。字符串（函數(shù)名），或是可調用對象，需要其函數(shù)簽名形如：scorer(estimator, x, y)；如果是none，則使用estimator的誤差估計函數(shù)。

cv :交叉驗證參數(shù)，默認none，使用三折交叉驗證。指定fold數(shù)量，默認為3，也可以是yield訓練/測試數(shù)據(jù)的生成器。

refit :默認為true,程序將會以交叉驗證訓練集得到的最佳參數(shù)，重新對所有可用的訓練集與開發(fā)集進行，作為最終用于性能評估的最佳模型參數(shù)。即在搜索參數(shù)結束后，用最佳參數(shù)結果再次fit一遍全部數(shù)據(jù)集。

iid:默認true,為true時，默認為各個樣本fold概率分布一致，誤差估計為所有樣本之和，而非各個fold的平均。

verbose：日志冗長度，int：冗長度，0：不輸出訓練過程，1：偶爾輸出，>1：對每個子模型都輸出。

n_jobs: 并行數(shù)，int：個數(shù),-1：跟cpu核數(shù)一致, 1:默認值。

pre_dispatch：指定總共分發(fā)的并行任務數(shù)。當n_jobs大于1時，數(shù)據(jù)將在每個運行點進行復制，這可能導致oom，而設置pre_dispatch參數(shù)，則可以預先劃分總共的job數(shù)量，使數(shù)據(jù)最多被復制pre_dispatch次

進行預測的常用方法和屬性

grid.fit()：運行網(wǎng)格搜索

grid_scores_：給出不同參數(shù)情況下的評價結果

best_params_：描述了已取得最佳結果的參數(shù)的組合

best_score_：成員提供優(yōu)化過程期間觀察到的最好的評分

				?

									model=lasso()

									alpha_can=np.logspace(-3,2,10)

									np.set_printoptions(suppress=true)#設置打印選項

									print("alpha_can=",alpha_can)

									#cv :交叉驗證參數(shù)，默認none 這里為5折交叉

									# param_grid：值為字典或者列表，即需要最優(yōu)化的參數(shù)的取值

									lasso_model=gridsearchcv(model,param_grid={'alpha':alpha_can},cv=5)#得到最好的參數(shù)

									lasso_model.fit(x_train,y_train)

									print('超參數(shù)：\n',lasso_model.best_params_)

									print("估計器\n",lasso_model.best_estimator_)

一文搞懂Python Sklearn庫使用

如果有transform,使用pipeline簡化系統(tǒng)搭建流程，將transform與分類器串聯(lián)起來（pipelineof transforms with a final estimator）

				?

									pipeline= pipeline([("features", combined_features), ("svm", svm)])  

									param_grid= dict(features__pca__n_components=[1, 2, 3],  

									                  features__univ_select__k=[1,2],  

									                  svm__c=[0.1, 1, 10])  

									grid_search= gridsearchcv(pipeline, param_grid=param_grid, verbose=10)  

									grid_search.fit(x,y)  

									print(grid_search.best_estimator_)

8 standardscaler

作用：去均值和方差歸一化。且是針對每一個特征維度來做的，而不是針對樣本。

【注意：】
并不是所有的標準化都能給estimator帶來好處。

				?

									# coding=utf-8

									# 統(tǒng)計訓練集的 mean 和　std 信息

									from sklearn.preprocessing import standardscaler

									import numpy as np

									def test_algorithm():

									    np.random.seed(123)

									    print('use standardscaler')

									    # 注：shape of data: [n_samples, n_features]

									    data = np.random.randn(3, 4)

									    scaler = standardscaler()

									    scaler.fit(data)

									    trans_data = scaler.transform(data)

									    print('original data: ')

									    print(data)

									    print('transformed data: ')

									    print(trans_data)

									    print('scaler info: scaler.mean_: {}, scaler.var_: {}'.format(scaler.mean_, scaler.var_))

									    print('\n')

									    print('use numpy by self')

									    mean = np.mean(data, axis=0)

									    std = np.std(data, axis=0)

									    var = std * std

									    print('mean: {}, std: {}, var: {}'.format(mean, std, var))

									    # numpy 的廣播功能

									    another_trans_data = data - mean

									    # 注：是除以標準差

									    another_trans_data = another_trans_data / std

									    print('another_trans_data: ')

									    print(another_trans_data)

									if __name__ == '__main__':

									    test_algorithm()

運行結果：

一文搞懂Python Sklearn庫使用

9 polynomialfeatures

使用sklearn.preprocessing.polynomialfeatures來進行特征的構造。

它是使用多項式的方法來進行的，如果有a，b兩個特征，那么它的2次多項式為（1,a,b,a^2,ab, b^2）。

polynomialfeatures有三個參數(shù)

degree：控制多項式的度

interaction_only：默認為false，如果指定為true，那么就不會有特征自己和自己結合的項，上面的二次項中沒有a^2和b^2。

include_bias：默認為true。如果為true的話，那么就會有上面的 1那一項。

				?

									import pandas as pd

									from sklearn.neighbors import kneighborsclassifier

									from sklearn.model_selection import gridsearchcv

									from sklearn.pipeline import pipeline

									path = r"activity_recognizer\1.csv"

									# 數(shù)據(jù)在https://archive.ics.uci.edu/ml/datasets/activity+recognition+from+single+chest-mounted+accelerometer

									df = pd.read_csv(path, header=none)

									df.columns = ['index', 'x', 'y', 'z', 'activity']

									knn = kneighborsclassifier()

									knn_params = {'n_neighbors': [3, 4, 5, 6]}

									x = df[['x', 'y', 'z']]

									y = df['activity']

									from sklearn.preprocessing import polynomialfeatures

									poly = polynomialfeatures(degree=2, include_bias=false, interaction_only=false)

									x_ploly = poly.fit_transform(x)

									x_ploly_df = pd.dataframe(x_ploly, columns=poly.get_feature_names())

									print(x_ploly_df.head())

運行結果：

x0 x1 x2 x0^2 x0 x1 x0 x2 x1^2 \
0 1502.0 2215.0 2153.0 2256004.0 3326930.0 3233806.0 4906225.0
1 1667.0 2072.0 2047.0 2778889.0 3454024.0 3412349.0 4293184.0
2 1611.0 1957.0 1906.0 2595321.0 3152727.0 3070566.0 3829849.0
3 1601.0 1939.0 1831.0 2563201.0 3104339.0 2931431.0 3759721.0
4 1643.0 1965.0 1879.0 2699449.0 3228495.0 3087197.0 3861225.0

x1 x2 x2^2
0 4768895.0 4635409.0
1 4241384.0 4190209.0
2 3730042.0 3632836.0
3 3550309.0 3352561.0
4 3692235.0 3530641.0

4、10+款機器學習算法對比

sklearn api：http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

4.1 生成數(shù)據(jù)

				?

									import numpy as np

									np.random.seed(10)

									%matplotlib inline 

									import matplotlib.pyplot as plt

									import pandas as pd

									from sklearn.datasets import make_classification

									from sklearn.linear_model import logisticregression

									from sklearn.ensemble import (randomtreesembedding, randomforestclassifier,

									                              gradientboostingclassifier)

									from sklearn.preprocessing import onehotencoder

									from sklearn.model_selection import train_test_split

									from sklearn.metrics import roc_curve,accuracy_score,recall_score

									from sklearn.pipeline import make_pipeline

									from sklearn.calibration import calibration_curve

									import copy

									print(__doc__)

									from matplotlib.colors import listedcolormap

									from sklearn.model_selection import train_test_split

									from sklearn.preprocessing import standardscaler

									from sklearn.datasets import make_moons, make_circles, make_classification

									from sklearn.neural_network import mlpclassifier

									from sklearn.neighbors import kneighborsclassifier

									from sklearn.svm import svc

									from sklearn.gaussian_process import gaussianprocessclassifier

									from sklearn.gaussian_process.kernels import rbf

									from sklearn.tree import decisiontreeclassifier

									from sklearn.ensemble import randomforestclassifier, adaboostclassifier

									from sklearn.naive_bayes import gaussiannb

									from sklearn.discriminant_analysis import quadraticdiscriminantanalysis

									# 數(shù)據(jù)

									x, y = make_classification(n_samples=100000)

									x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state = 4000)  # 對半分

									x_train, x_train_lr, y_train, y_train_lr = train_test_split(x_train,

									                                                            y_train,

									                                                            test_size=0.2,random_state = 4000)

									print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

									def ylabel(y_pred):

									    y_pred_f = copy.copy(y_pred)

									    y_pred_f[y_pred_f>=0.5] = 1

									    y_pred_f[y_pred_f<0.5] = 0

									    return y_pred_f

									def acc_recall(y_test, y_pred_rf):

									    return {'accuracy': accuracy_score(y_test, ylabel(y_pred_rf)), \

									            'recall': recall_score(y_test, ylabel(y_pred_rf))}

4.2 八款主流機器學習模型

				?

									h = .02  # step size in the mesh

									names = ["nearest neighbors", "linear svm", "rbf svm",

									         "decision tree", "neural net", "adaboost",

									         "naive bayes", "qda"]

									# 去掉"gaussian process"，太耗時，是其他的300倍以上

									classifiers = [

									    kneighborsclassifier(3),

									    svc(kernel="linear", c=0.025),

									    svc(gamma=2, c=1),

									    #gaussianprocessclassifier(1.0 * rbf(1.0)),

									    decisiontreeclassifier(max_depth=5),

									    #randomforestclassifier(max_depth=5, n_estimators=10, max_features=1),

									    mlpclassifier(alpha=1),

									    adaboostclassifier(),

									    gaussiannb(),

									    quadraticdiscriminantanalysis()]

									predicteight = {}

									for name, clf in zip(names, classifiers):

									    predicteight[name] = {}

									    predicteight[name]['prob_pos'],predicteight[name]['fpr_tpr'],predicteight[name]['acc_recall'] = [],[],[]

									    predicteight[name]['importance'] = []

									    print('\n --- start model : %s ----\n'%name)

									    %time clf.fit(x_train, y_train)

									    # 一些計算決策邊界的模型 計算decision_function

									    if hasattr(clf, "decision_function"):

									        %time prob_pos = clf.decision_function(x_test)

									        # # the confidence score for a sample is the signed distance of that sample to the hyperplane.

									    else:

									        %time prob_pos= clf.predict_proba(x_test)[:, 1]

									        prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

									        # 需要歸一化

									    predicteight[name]['prob_pos'] = prob_pos

									    # 計算roc、acc、recall

									    predicteight[name]['fpr_tpr'] = roc_curve(y_test, prob_pos)[:2]

									    predicteight[name]['acc_recall'] = acc_recall(y_test, prob_pos)  # 計算準確率與召回

									    # 提取信息

									    if hasattr(clf, "coef_"):

									        predicteight[name]['importance'] = clf.coef_

									    elif hasattr(clf, "feature_importances_"):

									        predicteight[name]['importance'] = clf.feature_importances_

									    elif hasattr(clf, "sigma_"):

									        predicteight[name]['importance'] = clf.sigma_

									        # variance of each feature per class 在樸素貝葉斯之中體現(xiàn)

結果輸出類似：

automatically created module for ipython interactive environment

--- start model : nearest neighbors ----

cpu times: user 103 ms, sys: 0 ns, total: 103 ms
wall time: 103 ms
cpu times: user 2min 8s, sys: 3.43 ms, total: 2min 8s
wall time: 2min 9s

--- start model : linear svm ----

cpu times: user 25.4 s, sys: 149 ms, total: 25.6 s
wall time: 25.6 s
cpu times: user 3.47 s, sys: 1.23 ms, total: 3.47 s
wall time: 3.47 s

4.3 樹模型 - 隨機森林

案例地址：http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py

				?

									'''

									model 0 : lm

									logistic

									'''

									print('lm 開始計算...')

									lm = logisticregression()

									%time lm.fit(x_train, y_train)

									y_pred_lm = lm.predict_proba(x_test)[:, 1]

									fpr_lm, tpr_lm, _ = roc_curve(y_test, y_pred_lm)

									lm_ar = acc_recall(y_test, y_pred_lm)  # 計算準確率與召回

									'''

									model 1 : rt + lm

									無監(jiān)督變換 + lg

									'''

									# unsupervised transformation based on totally random trees

									print('隨機森林編碼+lm 開始計算...')

									rt = randomtreesembedding(max_depth=3, n_estimators=n_estimator,

									    random_state=0)

									# 數(shù)據(jù)集的無監(jiān)督變換到高維稀疏表示。

									rt_lm = logisticregression()

									pipeline = make_pipeline(rt, rt_lm)

									%time pipeline.fit(x_train, y_train)

									y_pred_rt = pipeline.predict_proba(x_test)[:, 1]

									fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

									rt_lm_ar = acc_recall(y_test, y_pred_rt)  # 計算準確率與召回

									'''

									model 2 : rf / rf+lm

									'''

									print('\n 隨機森林系列 開始計算... ')

									# supervised transformation based on random forests

									rf = randomforestclassifier(max_depth=3, n_estimators=n_estimator)

									rf_enc = onehotencoder()

									rf_lm = logisticregression()

									rf.fit(x_train, y_train)

									rf_enc.fit(rf.apply(x_train))  # rf.apply(x_train)-(1310, 100)     x_train-(1310, 20)

									# 用100棵樹的信息作為x，載入做lm模型

									%time rf_lm.fit(rf_enc.transform(rf.apply(x_train_lr)), y_train_lr)

									y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(x_test)))[:, 1]

									fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

									rf_lm_ar = acc_recall(y_test, y_pred_rf_lm)  # 計算準確率與召回

									'''

									model 2 : grd / grd + lm

									'''

									print('\n 梯度提升樹系列 開始計算... ')

									grd = gradientboostingclassifier(n_estimators=n_estimator)

									grd_enc = onehotencoder()

									grd_lm = logisticregression()

									grd.fit(x_train, y_train)

									grd_enc.fit(grd.apply(x_train)[:, :, 0])

									%time grd_lm.fit(grd_enc.transform(grd.apply(x_train_lr)[:, :, 0]), y_train_lr)

									y_pred_grd_lm = grd_lm.predict_proba(

									    grd_enc.transform(grd.apply(x_test)[:, :, 0]))[:, 1]

									fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)

									grd_lm_ar = acc_recall(y_test, y_pred_grd_lm)  # 計算準確率與召回

									# the gradient boosted model by itself

									y_pred_grd = grd.predict_proba(x_test)[:, 1]

									fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)

									grd_ar = acc_recall(y_test, y_pred_grd)  # 計算準確率與召回

									# the random forest model by itself

									y_pred_rf = rf.predict_proba(x_test)[:, 1]

									fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

									rf_ar = acc_recall(y_test, y_pred_rf)  # 計算準確率與召回

輸出結果為：

lm 開始計算...
隨機森林編碼+lm 開始計算...
cpu times: user 591 ms, sys: 85.5 ms, total: 677 ms
wall time: 574 ms

隨機森林系列開始計算...
cpu times: user 76 ms, sys: 0 ns, total: 76 ms
wall time: 76 ms

梯度提升樹系列開始計算...
cpu times: user 60.6 ms, sys: 0 ns, total: 60.6 ms
wall time: 60.6 ms

4.4 一些結果展示：每個模型的準確率與召回率

				?

									# 8款常規(guī)模型

									for x,y in predicteight.items():

									    print('\n ----- the model  : %s , -----\n '%(x)  )

									    print(predicteight[x]['acc_recall'])

									# 樹模型

									names = ['lm','lm + rt','lm + rf','gbt + lm','gbt','rf']

									ar_list = [lm_ar,rt_lm_ar,rf_lm_ar,grd_lm_ar,grd_ar,rf_ar]

									for x,y in zip(names,ar_list):

									    print('\n --- %s 準確率與召回為: ---- \n '%x,y)

結果輸出：

----- the model : linear svm , -----
{'recall': 0.84561049445005043, 'accuracy': 0.89100000000000001}
---- the model : decision tree , -----
{'recall': 0.90918264379414737, 'accuracy': 0.89949999999999997}
----- the model : adaboost , -----
{'recall': 0.028254288597376387, 'accuracy': 0.51800000000000002}
----- the model : neural net , -----
{'recall': 0.91523713420787078, 'accuracy': 0.90249999999999997}
----- the model : naive bayes , -----
{'recall': 0.91523713420787078, 'accuracy': 0.89300000000000002}

4.5 結果展示：校準曲線

calibration curves may also be referred to as reliability diagrams.
可靠性檢驗的方式。

				?

									# #############################################################################

									# plot calibration plots

									names = ["nearest neighbors", "linear svm", "rbf svm",

									         "decision tree", "neural net", "adaboost",

									         "naive bayes", "qda"]

									plt.figure(figsize=(15, 15))

									ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)

									ax2 = plt.subplot2grid((3, 1), (2, 0))

									ax1.plot([0, 1], [0, 1], "k:", label="perfectly calibrated")

									for prob_pos, name in [[predicteight[n]['prob_pos'],n] for n in names] + [(y_pred_lm,'lm'),

									                       (y_pred_rt,'rt + lm'),

									                       (y_pred_rf_lm,'rf + lm'),

									                       (y_pred_grd_lm,'gbt + lm'),

									                       (y_pred_grd,'gbt'),

									                       (y_pred_rf,'rf')]:

									    prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

									    fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_pos, n_bins=10)

									    ax1.plot(mean_predicted_value, fraction_of_positives, "s-",

									             label="%s" % (name, ))

									    ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,

									             histtype="step", lw=2)

									ax1.set_ylabel("fraction of positives")

									ax1.set_ylim([-0.05, 1.05])

									ax1.legend(loc="lower right")

									ax1.set_title('calibration plots  (reliability curve)')

									ax2.set_xlabel("mean predicted value")

									ax2.set_ylabel("count")

									ax2.legend(loc="upper center", ncol=2)

									plt.tight_layout()

									plt.show()

第一張圖
fraction_of_positives,每個概率片段,正數(shù)的比例= 正數(shù)/總數(shù)
mean predicted value,每個概率片段,正數(shù)的平均值
第二張圖
每個概率分數(shù)段的個數(shù)

結果展示為：

一文搞懂Python Sklearn庫使用

4.6 模型的結果展示：重要性輸出

大家都知道一些樹模型可以輸出重要性，回歸模型可以輸出系數(shù)，帶有決策平面的（譬如svm）可以計算點到?jīng)Q策邊界的距離。

				?

									# 重要性

									print('\n -------- radomfree importances ------------\n')

									print(rf.feature_importances_)

									print('\n -------- gradientboosting importances ------------\n')

									print(grd.feature_importances_)

									print('\n -------- logistic coefficient  ------------\n')

									lm.coef_ 

									# 其他幾款模型的特征選擇

									[[predicteight[n]['importance'],n] for n in names if predicteight[n]['importance'] != [] ]

在本次10+機器學習案例之中，可以看到，可以輸出重要性的模型有：
隨機森林rf.feature_importances_
gbtgrd.feature_importances_
decision tree decision.feature_importances_
adaboost adaboost.feature_importances_

可以計算系數(shù)的有：線性模型，lm.coef_ 、 svm svm.coef_

naive bayes得到的是：naivebayes.sigma_

解釋為：variance of each feature per class

4.7 roc值的計算與plot

				?

									plt.figure(1)

									plt.plot([0, 1], [0, 1], 'k--')

									plt.plot(fpr_lm, tpr_lm, label='lr')

									plt.plot(fpr_rt_lm, tpr_rt_lm, label='rt + lr')

									plt.plot(fpr_rf, tpr_rf, label='rf')

									plt.plot(fpr_rf_lm, tpr_rf_lm, label='rf + lr')

									plt.plot(fpr_grd, tpr_grd, label='gbt')

									plt.plot(fpr_grd_lm, tpr_grd_lm, label='gbt + lr')

									# 8 款模型

									for (fpr,tpr),name in [[predicteight[n]['fpr_tpr'],n] for n in names] :

									    plt.plot(fpr, tpr, label=name)

									plt.xlabel('false positive rate')

									plt.ylabel('true positive rate')

									plt.title('roc curve')

									plt.legend(loc='best')

									plt.show()

									plt.figure(2)

									plt.xlim(0, 0.2)

									plt.ylim(0.4, 1)     # ylim改變     # matt

									plt.plot([0, 1], [0, 1], 'k--')

									plt.plot(fpr_lm, tpr_lm, label='lr')

									plt.plot(fpr_rt_lm, tpr_rt_lm, label='rt + lr')

									plt.plot(fpr_rf, tpr_rf, label='rf')

									plt.plot(fpr_rf_lm, tpr_rf_lm, label='rf + lr')

									plt.plot(fpr_grd, tpr_grd, label='gbt')

									plt.plot(fpr_grd_lm, tpr_grd_lm, label='gbt + lr')

									for (fpr,tpr),name in [[predicteight[n]['fpr_tpr'],n] for n in names] :

									    plt.plot(fpr, tpr, label=name)

									plt.xlabel('false positive rate')

									plt.ylabel('true positive rate')

									plt.title('roc curve (zoomed in at top left)')

									plt.legend(loc='best')

									plt.show()