python sklearn庫是一個豐富的機器學習庫,里面包含內容太多,這里對一些工程里常用的操作做個簡要的概述,以后還會根據(jù)自己用的進行更新。
1、labelencoder
簡單來說 labelencoder 是對不連續(xù)的數(shù)字或者文本進行按序編號,可以用來生成屬性/標簽
1
2
3
4
5
|
from sklearn.preprocessing import labelencoder encoder = labelencoder() encoder.fit([ 1 , 3 , 2 , 6 ]) t = encoder.transform([ 1 , 6 , 6 , 2 ]) print (t) |
輸出: [0 3 3 1]
2、onehotencoder
onehotencoder 用于將表示分類的數(shù)據(jù)擴維,將[[1],[2],[3],[4]]映射為 0,1,2,3的位置為1(高維的數(shù)據(jù)自己可以測試):
1
2
3
4
|
from sklearn.preprocessing import onehotencoder onehot = onehotencoder() #聲明一個編碼器 onehot.fit([[ 1 ],[ 2 ],[ 3 ],[ 4 ]]) print (onehot.transform([[ 2 ],[ 3 ],[ 1 ],[ 4 ]]).toarray()) |
輸出:[[0. 1. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]]
正如keras中的keras.utils.to_categorical(y_train, num_classes)
3、sklearn.model_selection.train_test_split隨機劃分訓練集和測試集
一般形式:
train_test_split是交叉驗證中常用的函數(shù),功能是從樣本中隨機的按比例選取train data和testdata,形式為:
1
|
x_train,x_test, y_train, y_test = train_test_split(train_data,train_target,test_size = 0.2 , train_size = 0.8 ,random_state = 0 ) |
參數(shù)解釋:
- train_data:所要劃分的樣本特征集
- train_target:所要劃分的樣本結果
- test_size:測試樣本占比,如果是整數(shù)的話就是樣本的數(shù)量
-train_size:訓練樣本的占比,(注:測試占比和訓練占比任寫一個就行)
- random_state:是隨機數(shù)的種子。
- 隨機數(shù)種子:其實就是該組隨機數(shù)的編號,在需要重復試驗的時候,保證得到一組一樣的隨機數(shù)。比如你每次都填1,其他參數(shù)一樣的情況下你得到的隨機數(shù)組是一樣的。但填0或不填,每次都會不一樣。
隨機數(shù)的產(chǎn)生取決于種子,隨機數(shù)和種子之間的關系遵從以下兩個規(guī)則:
- 種子不同,產(chǎn)生不同的隨機數(shù);種子相同,即使實例不同也產(chǎn)生相同的隨機數(shù)。
1
2
3
4
5
6
7
8
9
10
11
|
from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris iris = load_iris() train = iris.data target = iris.target # 避免過擬合,采用交叉驗證,驗證集占訓練集20%,固定隨機種子(random_state) train_x,test_x, train_y, test_y = train_test_split(train, target, test_size = 0.2 , random_state = 0 ) print (train_y.shape) |
得到的結果數(shù)據(jù):train_x : 訓練集的數(shù)據(jù),train_y:訓練集的標簽,對應test 為測試集的數(shù)據(jù)和標簽
4、pipeline
本節(jié)參考與文章:用 pipeline 將訓練集參數(shù)重復應用到測試集
pipeline 實現(xiàn)了對全部步驟的流式化封裝和管理,可以很方便地使參數(shù)集在新數(shù)據(jù)集上被重復使用。
pipeline 可以用于下面幾處:
- 模塊化 feature transform,只需寫很少的代碼就能將新的 feature 更新到訓練集中。
- 自動化 grid search,只要預先設定好使用的 model 和參數(shù)的候選,就能自動搜索并記錄最佳的 model。
- 自動化 ensemble generation,每隔一段時間將現(xiàn)有最好的 k 個 model 拿來做 ensemble。
問題是要對數(shù)據(jù)集 breast cancer wisconsin 進行分類,
該數(shù)據(jù)集包含 569 個樣本,第一列 id,第二列類別(m=惡性腫瘤,b=良性腫瘤),
第 3-32 列是實數(shù)值的特征。
我們要用 pipeline 對訓練集和測試集進行如下操作:
- 先用 standardscaler 對數(shù)據(jù)集每一列做標準化處理,(是 transformer)
- 再用 pca 將原始的 30 維度特征壓縮的 2 維度,(是 transformer)
- 最后再用模型 logisticregression。(是 estimator)
- 調用 pipeline 時,輸入由元組構成的列表,每個元組第一個值為變量名,元組第二個元素是 sklearn 中的 transformer
- 或 estimator。
注意中間每一步是 transformer,即它們必須包含 fit 和 transform 方法,或者 fit_transform。
最后一步是一個 estimator,即最后一步模型要有 fit 方法,可以沒有 transform 方法。
然后用 pipeline.fit對訓練集進行訓練,pipe_lr.fit(x_train, y_train)
再直接用 pipeline.score 對測試集進行預測并評分 pipe_lr.score(x_test, y_test)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import labelencoder from sklearn.preprocessing import standardscaler from sklearn.decomposition import pca from sklearn.linear_model import logisticregression from sklearn.pipeline import pipeline #需要聯(lián)網(wǎng) df = pd.read_csv( 'http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data' , header = none) # breast cancer wisconsin dataset x, y = df.values[:, 2 :], df.values[:, 1 ] encoder = labelencoder() y = encoder.fit_transform(y) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = . 2 , random_state = 0 ) pipe_lr = pipeline([( 'sc' , standardscaler()), ( 'pca' , pca(n_components = 2 )), ( 'clf' , logisticregression(random_state = 1 )) ]) pipe_lr.fit(x_train, y_train) print ( 'test accuracy: %.3f' % pipe_lr.score(x_test, y_test)) |
還可以用來選擇特征:
例如用 selectkbest 選擇特征,
分類器為 svm,
1
2
3
|
anova_filter = selectkbest(f_regression, k = 5 ) clf = svm.svc(kernel = 'linear' ) anova_svm = pipeline([( 'anova' , anova_filter), ( 'svc' , clf)]) |
當然也可以應用 k-fold cross validation:
pipeline 的工作方式:
當管道 pipeline 執(zhí)行 fit 方法時,
首先 standardscaler 執(zhí)行 fit 和 transform 方法,
然后將轉換后的數(shù)據(jù)輸入給 pca,
pca 同樣執(zhí)行 fit 和 transform 方法,
再將數(shù)據(jù)輸入給 logisticregression,進行訓練。
5 perdict 直接返回預測值
predict_proba返回每組數(shù)據(jù)預測值的概率,每行的概率和為1,如訓練集/測試集有 下例中的兩個類別,測試集有三個,則 predict返回的是一個 3*1的向量,而 predict_proba 返回的是 3*2維的向量,如下結果所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
# conding :utf-8 from sklearn.linear_model import logisticregression import numpy as np x_train = np.array([[ 1 , 2 , 3 ], [ 1 , 3 , 4 ], [ 2 , 1 , 2 ], [ 4 , 5 , 6 ], [ 3 , 5 , 3 ], [ 1 , 7 , 2 ]]) y_train = np.array([ 3 , 3 , 3 , 2 , 2 , 2 ]) x_test = np.array([[ 2 , 2 , 2 ], [ 3 , 2 , 6 ], [ 1 , 7 , 4 ]]) clf = logisticregression() clf.fit(x_train, y_train) # 返回預測標簽 print (clf.predict(x_test)) # 返回預測屬于某標簽的概率 print (clf.predict_proba(x_test)) |
6 sklearn.metrics中的評估方法
1. sklearn.metrics.roc_curve(true_y. pred_proba_score, pos_labal)
計算roc曲線,roc曲線有三個屬性:fpr, tpr,和閾值,因此該函數(shù)返回這三個變量,l
2. sklearn.metrics.auc(x, y, reorder=false):
計算auc值,其中x,y分別為數(shù)組形式,根據(jù)(xi, yi)在坐標上的點,生成的曲線,然后計算auc值;
1
2
3
4
5
6
7
8
9
10
|
import numpy as np from sklearn.metrics import roc_curve from sklearn.metrics import auc y = np.array([ 1 , 0 , 2 , 2 ]) pred = np.array([ 0.1 , 0.4 , 0.35 , 0.8 ]) fpr, tpr, thresholds = roc_curve(y, pred, pos_label = 2 ) print (tpr) print (fpr) print (thresholds) print (auc(fpr, tpr)) |
3. sklearn.metrics.roc_auc_score(true_y, pred_proba_y)
直接根據(jù)真實值(必須是二值)、預測值(可以是0/1, 也可以是proba值)計算出auc值,中間過程的roc計算省略
7 gridsearchcv
gridsearchcv,它存在的意義就是自動調參,只要把參數(shù)輸進去,就能給出最優(yōu)化的結果和參數(shù)。但是這個方法適合于小數(shù)據(jù)集,一旦數(shù)據(jù)的量級上去了,很難得出結果。這個時候就是需要動腦筋了。數(shù)據(jù)量比較大的時候可以使用一個快速調優(yōu)的方法——坐標下降。它其實是一種貪心算法:拿當前對模型影響最大的參數(shù)調優(yōu),直到最優(yōu)化;再拿下一個影響最大的參數(shù)調優(yōu),如此下去,直到所有的參數(shù)調整完畢。這個方法的缺點就是可能會調到局部最優(yōu)而不是全局最優(yōu),但是省時間省力,巨大的優(yōu)勢面前,還是試一試吧,后續(xù)可以再拿bagging再優(yōu)化。
回到sklearn里面的gridsearchcv,gridsearchcv用于系統(tǒng)地遍歷多種參數(shù)組合,通過交叉驗證確定最佳效果參數(shù)。
gridsearchcv的sklearn官方網(wǎng)址:http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.gridsearchcv.html#sklearn.model_selection.gridsearchcv
classsklearn.model_selection.gridsearchcv(estimator,param_grid, scoring=none, fit_params=none, n_jobs=1, iid=true, refit=true,cv=none, verbose=0, pre_dispatch='2*n_jobs', error_score='raise',return_train_score=true)
常用參數(shù)解讀
estimator:所使用的分類器,如estimator=randomforestclassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10), 并且傳入除需要確定最佳的參數(shù)之外的其他參數(shù)。每一個分類器都需要一個scoring參數(shù),或者score方法。
param_grid:值為字典或者列表,即需要最優(yōu)化的參數(shù)的取值,param_grid =param_test1,param_test1 = {'n_estimators':range(10,71,10)}。
scoring :準確度評價標準,默認none,這時需要使用score函數(shù);或者如scoring='roc_auc',根據(jù)所選模型不同,評價準則不同。字符串(函數(shù)名),或是可調用對象,需要其函數(shù)簽名形如:scorer(estimator, x, y);如果是none,則使用estimator的誤差估計函數(shù)。
cv :交叉驗證參數(shù),默認none,使用三折交叉驗證。指定fold數(shù)量,默認為3,也可以是yield訓練/測試數(shù)據(jù)的生成器。
refit :默認為true,程序將會以交叉驗證訓練集得到的最佳參數(shù),重新對所有可用的訓練集與開發(fā)集進行,作為最終用于性能評估的最佳模型參數(shù)。即在搜索參數(shù)結束后,用最佳參數(shù)結果再次fit一遍全部數(shù)據(jù)集。
iid:默認true,為true時,默認為各個樣本fold概率分布一致,誤差估計為所有樣本之和,而非各個fold的平均。
verbose:日志冗長度,int:冗長度,0:不輸出訓練過程,1:偶爾輸出,>1:對每個子模型都輸出。
n_jobs: 并行數(shù),int:個數(shù),-1:跟cpu核數(shù)一致, 1:默認值。
pre_dispatch:指定總共分發(fā)的并行任務數(shù)。當n_jobs大于1時,數(shù)據(jù)將在每個運行點進行復制,這可能導致oom,而設置pre_dispatch參數(shù),則可以預先劃分總共的job數(shù)量,使數(shù)據(jù)最多被復制pre_dispatch次
進行預測的常用方法和屬性
grid.fit():運行網(wǎng)格搜索
grid_scores_:給出不同參數(shù)情況下的評價結果
best_params_:描述了已取得最佳結果的參數(shù)的組合
best_score_:成員提供優(yōu)化過程期間觀察到的最好的評分
1
2
3
4
5
6
7
8
9
10
|
model = lasso() alpha_can = np.logspace( - 3 , 2 , 10 ) np.set_printoptions(suppress = true) #設置打印選項 print ( "alpha_can=" ,alpha_can) #cv :交叉驗證參數(shù),默認none 這里為5折交叉 # param_grid:值為字典或者列表,即需要最優(yōu)化的參數(shù)的取值 lasso_model = gridsearchcv(model,param_grid = { 'alpha' :alpha_can},cv = 5 ) #得到最好的參數(shù) lasso_model.fit(x_train,y_train) print ( '超參數(shù):\n' ,lasso_model.best_params_) print ( "估計器\n" ,lasso_model.best_estimator_) |
如果有transform,使用pipeline簡化系統(tǒng)搭建流程,將transform與分類器串聯(lián)起來(pipelineof transforms with a final estimator)
1
2
3
4
5
6
7
8
|
pipeline = pipeline([( "features" , combined_features), ( "svm" , svm)]) param_grid = dict (features__pca__n_components = [ 1 , 2 , 3 ], features__univ_select__k = [ 1 , 2 ], svm__c = [ 0.1 , 1 , 10 ]) grid_search = gridsearchcv(pipeline, param_grid = param_grid, verbose = 10 ) grid_search.fit(x,y) print (grid_search.best_estimator_) |
8 standardscaler
作用:去均值和方差歸一化。且是針對每一個特征維度來做的,而不是針對樣本。
【注意:】
并不是所有的標準化都能給estimator帶來好處。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
# coding=utf-8 # 統(tǒng)計訓練集的 mean 和 std 信息 from sklearn.preprocessing import standardscaler import numpy as np def test_algorithm(): np.random.seed( 123 ) print ( 'use standardscaler' ) # 注:shape of data: [n_samples, n_features] data = np.random.randn( 3 , 4 ) scaler = standardscaler() scaler.fit(data) trans_data = scaler.transform(data) print ( 'original data: ' ) print (data) print ( 'transformed data: ' ) print (trans_data) print ( 'scaler info: scaler.mean_: {}, scaler.var_: {}' . format (scaler.mean_, scaler.var_)) print ( '\n' ) print ( 'use numpy by self' ) mean = np.mean(data, axis = 0 ) std = np.std(data, axis = 0 ) var = std * std print ( 'mean: {}, std: {}, var: {}' . format (mean, std, var)) # numpy 的廣播功能 another_trans_data = data - mean # 注:是除以標準差 another_trans_data = another_trans_data / std print ( 'another_trans_data: ' ) print (another_trans_data) if __name__ = = '__main__' : test_algorithm() |
運行結果:
9 polynomialfeatures
使用sklearn.preprocessing.polynomialfeatures來進行特征的構造。
它是使用多項式的方法來進行的,如果有a,b兩個特征,那么它的2次多項式為(1,a,b,a^2,ab, b^2)。
polynomialfeatures有三個參數(shù)
degree:控制多項式的度
interaction_only: 默認為false,如果指定為true,那么就不會有特征自己和自己結合的項,上面的二次項中沒有a^2和b^2。
include_bias:默認為true。如果為true的話,那么就會有上面的 1那一項。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
import pandas as pd from sklearn.neighbors import kneighborsclassifier from sklearn.model_selection import gridsearchcv from sklearn.pipeline import pipeline path = r "activity_recognizer\1.csv" # 數(shù)據(jù)在https://archive.ics.uci.edu/ml/datasets/activity+recognition+from+single+chest-mounted+accelerometer df = pd.read_csv(path, header = none) df.columns = [ 'index' , 'x' , 'y' , 'z' , 'activity' ] knn = kneighborsclassifier() knn_params = { 'n_neighbors' : [ 3 , 4 , 5 , 6 ]} x = df[[ 'x' , 'y' , 'z' ]] y = df[ 'activity' ] from sklearn.preprocessing import polynomialfeatures poly = polynomialfeatures(degree = 2 , include_bias = false, interaction_only = false) x_ploly = poly.fit_transform(x) x_ploly_df = pd.dataframe(x_ploly, columns = poly.get_feature_names()) print (x_ploly_df.head()) |
運行結果:
x0 x1 x2 x0^2 x0 x1 x0 x2 x1^2 \
0 1502.0 2215.0 2153.0 2256004.0 3326930.0 3233806.0 4906225.0
1 1667.0 2072.0 2047.0 2778889.0 3454024.0 3412349.0 4293184.0
2 1611.0 1957.0 1906.0 2595321.0 3152727.0 3070566.0 3829849.0
3 1601.0 1939.0 1831.0 2563201.0 3104339.0 2931431.0 3759721.0
4 1643.0 1965.0 1879.0 2699449.0 3228495.0 3087197.0 3861225.0
x1 x2 x2^2
0 4768895.0 4635409.0
1 4241384.0 4190209.0
2 3730042.0 3632836.0
3 3550309.0 3352561.0
4 3692235.0 3530641.0
4、10+款機器學習算法對比
sklearn api:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble
4.1 生成數(shù)據(jù)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
|
import numpy as np np.random.seed( 10 ) % matplotlib inline import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import make_classification from sklearn.linear_model import logisticregression from sklearn.ensemble import (randomtreesembedding, randomforestclassifier, gradientboostingclassifier) from sklearn.preprocessing import onehotencoder from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve,accuracy_score,recall_score from sklearn.pipeline import make_pipeline from sklearn.calibration import calibration_curve import copy print (__doc__) from matplotlib.colors import listedcolormap from sklearn.model_selection import train_test_split from sklearn.preprocessing import standardscaler from sklearn.datasets import make_moons, make_circles, make_classification from sklearn.neural_network import mlpclassifier from sklearn.neighbors import kneighborsclassifier from sklearn.svm import svc from sklearn.gaussian_process import gaussianprocessclassifier from sklearn.gaussian_process.kernels import rbf from sklearn.tree import decisiontreeclassifier from sklearn.ensemble import randomforestclassifier, adaboostclassifier from sklearn.naive_bayes import gaussiannb from sklearn.discriminant_analysis import quadraticdiscriminantanalysis # 數(shù)據(jù) x, y = make_classification(n_samples = 100000 ) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2 ,random_state = 4000 ) # 對半分 x_train, x_train_lr, y_train, y_train_lr = train_test_split(x_train, y_train, test_size = 0.2 ,random_state = 4000 ) print (x_train.shape, x_test.shape, y_train.shape, y_test.shape) def ylabel(y_pred): y_pred_f = copy.copy(y_pred) y_pred_f[y_pred_f> = 0.5 ] = 1 y_pred_f[y_pred_f< 0.5 ] = 0 return y_pred_f def acc_recall(y_test, y_pred_rf): return { 'accuracy' : accuracy_score(y_test, ylabel(y_pred_rf)), \ 'recall' : recall_score(y_test, ylabel(y_pred_rf))} |
4.2 八款主流機器學習模型
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
|
h = . 02 # step size in the mesh names = [ "nearest neighbors" , "linear svm" , "rbf svm" , "decision tree" , "neural net" , "adaboost" , "naive bayes" , "qda" ] # 去掉"gaussian process",太耗時,是其他的300倍以上 classifiers = [ kneighborsclassifier( 3 ), svc(kernel = "linear" , c = 0.025 ), svc(gamma = 2 , c = 1 ), #gaussianprocessclassifier(1.0 * rbf(1.0)), decisiontreeclassifier(max_depth = 5 ), #randomforestclassifier(max_depth=5, n_estimators=10, max_features=1), mlpclassifier(alpha = 1 ), adaboostclassifier(), gaussiannb(), quadraticdiscriminantanalysis()] predicteight = {} for name, clf in zip (names, classifiers): predicteight[name] = {} predicteight[name][ 'prob_pos' ],predicteight[name][ 'fpr_tpr' ],predicteight[name][ 'acc_recall' ] = [],[],[] predicteight[name][ 'importance' ] = [] print ( '\n --- start model : %s ----\n' % name) % time clf.fit(x_train, y_train) # 一些計算決策邊界的模型 計算decision_function if hasattr (clf, "decision_function" ): % time prob_pos = clf.decision_function(x_test) # # the confidence score for a sample is the signed distance of that sample to the hyperplane. else : % time prob_pos = clf.predict_proba(x_test)[:, 1 ] prob_pos = (prob_pos - prob_pos. min ()) / (prob_pos. max () - prob_pos. min ()) # 需要歸一化 predicteight[name][ 'prob_pos' ] = prob_pos # 計算roc、acc、recall predicteight[name][ 'fpr_tpr' ] = roc_curve(y_test, prob_pos)[: 2 ] predicteight[name][ 'acc_recall' ] = acc_recall(y_test, prob_pos) # 計算準確率與召回 # 提取信息 if hasattr (clf, "coef_" ): predicteight[name][ 'importance' ] = clf.coef_ elif hasattr (clf, "feature_importances_" ): predicteight[name][ 'importance' ] = clf.feature_importances_ elif hasattr (clf, "sigma_" ): predicteight[name][ 'importance' ] = clf.sigma_ # variance of each feature per class 在樸素貝葉斯之中體現(xiàn) |
結果輸出類似:
automatically created module for ipython interactive environment
--- start model : nearest neighbors ----
cpu times: user 103 ms, sys: 0 ns, total: 103 ms
wall time: 103 ms
cpu times: user 2min 8s, sys: 3.43 ms, total: 2min 8s
wall time: 2min 9s
--- start model : linear svm ----
cpu times: user 25.4 s, sys: 149 ms, total: 25.6 s
wall time: 25.6 s
cpu times: user 3.47 s, sys: 1.23 ms, total: 3.47 s
wall time: 3.47 s
4.3 樹模型 - 隨機森林
案例地址:http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
|
''' model 0 : lm logistic ''' print ( 'lm 開始計算...' ) lm = logisticregression() % time lm.fit(x_train, y_train) y_pred_lm = lm.predict_proba(x_test)[:, 1 ] fpr_lm, tpr_lm, _ = roc_curve(y_test, y_pred_lm) lm_ar = acc_recall(y_test, y_pred_lm) # 計算準確率與召回 ''' model 1 : rt + lm 無監(jiān)督變換 + lg ''' # unsupervised transformation based on totally random trees print ( '隨機森林編碼+lm 開始計算...' ) rt = randomtreesembedding(max_depth = 3 , n_estimators = n_estimator, random_state = 0 ) # 數(shù)據(jù)集的無監(jiān)督變換到高維稀疏表示。 rt_lm = logisticregression() pipeline = make_pipeline(rt, rt_lm) % time pipeline.fit(x_train, y_train) y_pred_rt = pipeline.predict_proba(x_test)[:, 1 ] fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt) rt_lm_ar = acc_recall(y_test, y_pred_rt) # 計算準確率與召回 ''' model 2 : rf / rf+lm ''' print ( '\n 隨機森林系列 開始計算... ' ) # supervised transformation based on random forests rf = randomforestclassifier(max_depth = 3 , n_estimators = n_estimator) rf_enc = onehotencoder() rf_lm = logisticregression() rf.fit(x_train, y_train) rf_enc.fit(rf. apply (x_train)) # rf.apply(x_train)-(1310, 100) x_train-(1310, 20) # 用100棵樹的信息作為x,載入做lm模型 % time rf_lm.fit(rf_enc.transform(rf. apply (x_train_lr)), y_train_lr) y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf. apply (x_test)))[:, 1 ] fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm) rf_lm_ar = acc_recall(y_test, y_pred_rf_lm) # 計算準確率與召回 ''' model 2 : grd / grd + lm ''' print ( '\n 梯度提升樹系列 開始計算... ' ) grd = gradientboostingclassifier(n_estimators = n_estimator) grd_enc = onehotencoder() grd_lm = logisticregression() grd.fit(x_train, y_train) grd_enc.fit(grd. apply (x_train)[:, :, 0 ]) % time grd_lm.fit(grd_enc.transform(grd. apply (x_train_lr)[:, :, 0 ]), y_train_lr) y_pred_grd_lm = grd_lm.predict_proba( grd_enc.transform(grd. apply (x_test)[:, :, 0 ]))[:, 1 ] fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm) grd_lm_ar = acc_recall(y_test, y_pred_grd_lm) # 計算準確率與召回 # the gradient boosted model by itself y_pred_grd = grd.predict_proba(x_test)[:, 1 ] fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd) grd_ar = acc_recall(y_test, y_pred_grd) # 計算準確率與召回 # the random forest model by itself y_pred_rf = rf.predict_proba(x_test)[:, 1 ] fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf) rf_ar = acc_recall(y_test, y_pred_rf) # 計算準確率與召回 |
輸出結果為:
lm 開始計算...
隨機森林編碼+lm 開始計算...
cpu times: user 591 ms, sys: 85.5 ms, total: 677 ms
wall time: 574 ms
隨機森林系列 開始計算...
cpu times: user 76 ms, sys: 0 ns, total: 76 ms
wall time: 76 ms
梯度提升樹系列 開始計算...
cpu times: user 60.6 ms, sys: 0 ns, total: 60.6 ms
wall time: 60.6 ms
4.4 一些結果展示:每個模型的準確率與召回率
1
2
3
4
5
6
7
8
9
10
|
# 8款常規(guī)模型 for x,y in predicteight.items(): print ( '\n ----- the model : %s , -----\n ' % (x) ) print (predicteight[x][ 'acc_recall' ]) # 樹模型 names = [ 'lm' , 'lm + rt' , 'lm + rf' , 'gbt + lm' , 'gbt' , 'rf' ] ar_list = [lm_ar,rt_lm_ar,rf_lm_ar,grd_lm_ar,grd_ar,rf_ar] for x,y in zip (names,ar_list): print ( '\n --- %s 準確率與召回為: ---- \n ' % x,y) |
結果輸出:
----- the model : linear svm , -----
{'recall': 0.84561049445005043, 'accuracy': 0.89100000000000001}
---- the model : decision tree , -----
{'recall': 0.90918264379414737, 'accuracy': 0.89949999999999997}
----- the model : adaboost , -----
{'recall': 0.028254288597376387, 'accuracy': 0.51800000000000002}
----- the model : neural net , -----
{'recall': 0.91523713420787078, 'accuracy': 0.90249999999999997}
----- the model : naive bayes , -----
{'recall': 0.91523713420787078, 'accuracy': 0.89300000000000002}
4.5 結果展示:校準曲線
calibration curves may also be referred to as reliability diagrams.
可靠性檢驗的方式。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
# ############################################################################# # plot calibration plots names = [ "nearest neighbors" , "linear svm" , "rbf svm" , "decision tree" , "neural net" , "adaboost" , "naive bayes" , "qda" ] plt.figure(figsize = ( 15 , 15 )) ax1 = plt.subplot2grid(( 3 , 1 ), ( 0 , 0 ), rowspan = 2 ) ax2 = plt.subplot2grid(( 3 , 1 ), ( 2 , 0 )) ax1.plot([ 0 , 1 ], [ 0 , 1 ], "k:" , label = "perfectly calibrated" ) for prob_pos, name in [[predicteight[n][ 'prob_pos' ],n] for n in names] + [(y_pred_lm, 'lm' ), (y_pred_rt, 'rt + lm' ), (y_pred_rf_lm, 'rf + lm' ), (y_pred_grd_lm, 'gbt + lm' ), (y_pred_grd, 'gbt' ), (y_pred_rf, 'rf' )]: prob_pos = (prob_pos - prob_pos. min ()) / (prob_pos. max () - prob_pos. min ()) fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_pos, n_bins = 10 ) ax1.plot(mean_predicted_value, fraction_of_positives, "s-" , label = "%s" % (name, )) ax2.hist(prob_pos, range = ( 0 , 1 ), bins = 10 , label = name, histtype = "step" , lw = 2 ) ax1.set_ylabel( "fraction of positives" ) ax1.set_ylim([ - 0.05 , 1.05 ]) ax1.legend(loc = "lower right" ) ax1.set_title( 'calibration plots (reliability curve)' ) ax2.set_xlabel( "mean predicted value" ) ax2.set_ylabel( "count" ) ax2.legend(loc = "upper center" , ncol = 2 ) plt.tight_layout() plt.show() |
第一張圖
fraction_of_positives,每個概率片段,正數(shù)的比例= 正數(shù)/總數(shù)
mean predicted value,每個概率片段,正數(shù)的平均值
第二張圖
每個概率分數(shù)段的個數(shù)
結果展示為:
4.6 模型的結果展示:重要性輸出
大家都知道一些樹模型可以輸出重要性,回歸模型可以輸出系數(shù),帶有決策平面的(譬如svm)可以計算點到?jīng)Q策邊界的距離。
1
2
3
4
5
6
7
8
9
|
# 重要性 print ( '\n -------- radomfree importances ------------\n' ) print (rf.feature_importances_) print ( '\n -------- gradientboosting importances ------------\n' ) print (grd.feature_importances_) print ( '\n -------- logistic coefficient ------------\n' ) lm.coef_ # 其他幾款模型的特征選擇 [[predicteight[n][ 'importance' ],n] for n in names if predicteight[n][ 'importance' ] ! = [] ] |
在本次10+機器學習案例之中,可以看到,可以輸出重要性的模型有:
隨機森林rf.feature_importances_
gbtgrd.feature_importances_
decision tree decision.feature_importances_
adaboost adaboost.feature_importances_
可以計算系數(shù)的有:線性模型,lm.coef_
、 svm svm.coef_
naive bayes得到的是:naivebayes.sigma_
解釋為:variance of each feature per class
4.7 roc值的計算與plot
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
plt.figure( 1 ) plt.plot([ 0 , 1 ], [ 0 , 1 ], 'k--' ) plt.plot(fpr_lm, tpr_lm, label = 'lr' ) plt.plot(fpr_rt_lm, tpr_rt_lm, label = 'rt + lr' ) plt.plot(fpr_rf, tpr_rf, label = 'rf' ) plt.plot(fpr_rf_lm, tpr_rf_lm, label = 'rf + lr' ) plt.plot(fpr_grd, tpr_grd, label = 'gbt' ) plt.plot(fpr_grd_lm, tpr_grd_lm, label = 'gbt + lr' ) # 8 款模型 for (fpr,tpr),name in [[predicteight[n][ 'fpr_tpr' ],n] for n in names] : plt.plot(fpr, tpr, label = name) plt.xlabel( 'false positive rate' ) plt.ylabel( 'true positive rate' ) plt.title( 'roc curve' ) plt.legend(loc = 'best' ) plt.show() plt.figure( 2 ) plt.xlim( 0 , 0.2 ) plt.ylim( 0.4 , 1 ) # ylim改變 # matt plt.plot([ 0 , 1 ], [ 0 , 1 ], 'k--' ) plt.plot(fpr_lm, tpr_lm, label = 'lr' ) plt.plot(fpr_rt_lm, tpr_rt_lm, label = 'rt + lr' ) plt.plot(fpr_rf, tpr_rf, label = 'rf' ) plt.plot(fpr_rf_lm, tpr_rf_lm, label = 'rf + lr' ) plt.plot(fpr_grd, tpr_grd, label = 'gbt' ) plt.plot(fpr_grd_lm, tpr_grd_lm, label = 'gbt + lr' ) for (fpr,tpr),name in [[predicteight[n][ 'fpr_tpr' ],n] for n in names] : plt.plot(fpr, tpr, label = name) plt.xlabel( 'false positive rate' ) plt.ylabel( 'true positive rate' ) plt.title( 'roc curve (zoomed in at top left)' ) plt.legend(loc = 'best' ) plt.show() |
到此這篇關于一文搞懂python sklearn庫使用方法的文章就介紹到這了,更多相關python sklearn庫內容請搜索服務器之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持服務器之家!
原文鏈接:https://blog.csdn.net/qq_29750461/article/details/81559848