|
1. 随机森林(Random Forest)学习算法
随机森林是一种一种分类算法,属于集成学习中的Bagging算法,即引导聚合类算法,由于不专注于解决困难样本,所以模型的performance会受到限制。在学习随机森林算法之前,首先要弄懂三个概念:决策树;集成学习(Ensemble Learning)[多分类系统];自主采样法(Boostrap Sampling)。
随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。随机森林属于机器学习的一大分支——集成学习(EnsembleLearning)方法。随机森林具有对于很多种资料,可以产生高准确度的分类器;可以处理大量的输入变数;可以在决定类别时,评估变数的重要性;可以在内部对于一般化后的误差产生不偏差的估计;对于不平衡的分类资料集来说,可以平衡误差等优点。
2. Matlab仿真
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%功能:演示随机森林算法在计算机视觉中的应用
%环境:Win7,Matlab2018a
%Modi: C.S
%时间:2022-4-5
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% path = ['E:\works\book\7(机器学习20讲)\Code\5、Random Forest\'];
path = ['F:\Learning\线上\csdn\视觉机器学习20讲\5、Random Forest\'];
data1 = textread([path 'satimage.tra']);
data2 = textread([path 'satimage.txt']);
% path = 'C:\Users\Administrator\Documents\MATLAB\';
% data = textread(path + 'srbct.txt');
% In this data set, each row represents a sample,
% and each column represents a kind of variable(feature, attribute).
% !! So we should transpose "x" and "xts" below.
[m1, n1] = size(data1);
[m2, n2] = size(data2);
ntest = m2; % The number of test set;
ntrain = m1; % The number of training set;
% Above lines we randomly select 2/3 data as training data,
% and remaining 1/3 data as test data.
x = (data1(1 : ntrain, 1 : n1 - 1));
x = x';
cl = (data1(1 : ntrain, n1));
xts = (data2(1 : ntest, 1 : n2 - 1));
xts = xts';
clts = (data2(1 : ntest, n2));
% Above lines we acquire x, cl, xts and clts from randomData;
nclass = 6;
% The data set has 4 classes.
classwt = 0;
% Here we set all class the same weight 1.
% It can also be written as "classwt = [1 1 1 1];".
cat0 = 0;
% Here we set it having no categorical variables.
runParam = [6 1 50 10 1 0];
% Here we set mtry = 80, ndsize = 1, jbt = 60, look = 10, lookcls = 1, mdim2nd = 0;
impOpt = [0 0 0];
% Here we set imp = 0, Interact = 0, impn = 0;
proCom = [0 0 0 0 0];
% Here we set nprox = 0, nrnn = 0, noutlier = 0, nscale = 0, nprot = 0;
missingVal = 0;
% Here we set missingVal = 0, that means we use the "Default Value" for missingVal.
% That is, code = -999.0, missingfill = 0;
saveForest = [0 0 0];
% Here we set isaverf = 0, isavepar = 0, isavefill = 0;
runForest = [0 0];
% Here we set irunrf = 0, ireadpar = 0;
outParam = [1,0,0,0,0,0,0,0,0,0];
% Here we set isumout = 1 to show a classification summary.
msm = 1 : 36;
% Here we use all 2308 variables, we can also use msm = 0 to use all variables.
seed = 4351;
x = single(x); %get train x
cl = int32(cl); %get train label
xts = single(xts); %get test x
clts = int32(clts); %get test label
classwt = single(classwt);
cat0 = int32(cat0);
msm = int32(msm);
runParam = int32(runParam);
impOpt = int32(impOpt);
proCom = int32(proCom);
missingVal = single(missingVal);
saveForest = int32(saveForest);
runForest = int32(runForest);
outParam = int32(outParam);
seed = int32(seed);
[errtr, errts, prox, trees, predictts, varimp, scale] = ...
RF(nclass, x, cl, xts, clts, classwt, cat0, msm, runParam, impOpt, ...
proCom, missingVal, saveForest, runForest, outParam, seed, 'satimage');
3. 仿真结果
>> main
* Class counts - training data
Class: 1 2 3 4 5 6
Counts: 1072 479 961 415 470 1038
* Class counts - test data
Class: 1 2 3 4 5 6
counts: 461 224 397 211 237 470
* Out of bag error:
jbt overall 1 2 3 4 5 6
train: 10 14.09 3.36 5.01 7.80 47.95 21.91 18.11
test: 10 10.35 1.08 1.34 6.05 36.49 14.35 13.62
train: 20 10.64 2.71 3.76 4.27 42.41 14.47 13.49
test: 20 9.35 0.65 2.23 6.05 35.55 10.55 11.70
train: 30 10.30 2.99 3.13 4.37 43.86 13.40 11.85
test: 30 9.10 0.43 2.23 6.05 35.55 10.55 10.85
train: 40 10.26 3.08 3.13 3.95 42.41 13.62 12.43
test: 40 9.20 0.65 3.13 6.05 35.55 10.55 10.64
train: 50 9.88 3.17 3.13 3.85 42.65 13.40 10.79
test: 50 9.20 0.87 3.13 6.05 35.55 10.55 10.43
* Summary output:
final error rate: 9.88%
final error test: 9.20%
Training set confusion matrix (OOB):
true class
1 2 3 4 5 6
1 1038 1 3 7 25 0
2 2 464 1 3 5 2
3 21 0 924 87 0 22
4 0 5 20 238 3 69
5 10 5 1 4 407 19
6 1 4 12 76 30 926
Test set confusion matrix:
true class
1 2 3 4 5 6
1 457 0 3 0 5 0
2 0 217 1 1 3 0
3 2 1 373 32 1 11
4 0 1 13 136 1 30
5 2 3 1 3 212 8
6 0 2 6 39 15 421
* RF all done!!!
4. 小结
随机森林方法的优点就是:
(1)在数据集上表现良好,相对于其他算法有较大的优势
(2)易于并行化,在大数据集上有很大的优势;
(3)能够处理高维度数据,不用做特征选择。
一般深度学习的课程中,随机森林都会在其中占有一席之地,对随机森林算法感兴趣的同学,推荐去仔细查看全文《机器学习20讲》中第五讲内容,源码在分享的资源中已打包好(这份源码有调用到一个封装的库,必须是32位的matlab才能运行成功,所以我也是特地安装了一个32位matlab才跑通这个例程),欢迎取用。
本系列文章列表如下: 视觉机器学习20讲-MATLAB源码示例(1)-Kmeans聚类算法 视觉机器学习20讲-MATLAB源码示例(2)-KNN学习算法 视觉机器学习20讲-MATLAB源码示例(3)-回归学习算法 视觉机器学习20讲-MATLAB源码示例(4)-决策树学习算法 视觉机器学习20讲-MATLAB源码示例(5)-随机森林(Random Forest)学习算法 视觉机器学习20讲-MATLAB源码示例(6)-贝叶斯学习算法 视觉机器学习20讲-MATLAB源码示例(7)-EM算法 视觉机器学习20讲-MATLAB源码示例(8)-Adaboost算法 视觉机器学习20讲-MATLAB源码示例(9)-SVM算法 视觉机器学习20讲-MATLAB源码示例(10)-增强学习算法 视觉机器学习20讲-MATLAB源码示例(11)-流形学习算法 视觉机器学习20讲-MATLAB源码示例(12)-RBF学习算法 视觉机器学习20讲-MATLAB源码示例(13)-稀疏表示算法 视觉机器学习20讲-MATLAB源码示例(14)-字典学习算法 视觉机器学习20讲-MATLAB源码示例(15)-BP学习算法 视觉机器学习20讲-MATLAB源码示例(16)-CNN学习算法 视觉机器学习20讲-MATLAB源码示例(17)-RBM学习算法 视觉机器学习20讲-MATLAB源码示例(18)-深度学习算法 视觉机器学习20讲-MATLAB源码示例(19)-遗传算法 视觉机器学习20讲-MATLAB源码示例(20)-蚁群算法 |
|