Caret R Package - classification and regression training
(http://topepo.github.io/caret/index.html)
(http://topepo.github.io/caret/index.html)
The caret package (short for classification
and regression
training) contains functions to
streamline the model training process for complex regression and classification
problems. The caret package is a set of functions that attempt to streamline the process for creating
predictive models. The package contains tools for:
- data
splitting
- pre-processing
- feature
selection
- model
tuning using resampling
- variable
importance estimation
Following are the steps to install caret
package (it has many dependencies).
Install "‘minqa’,
‘RcppEigen’, ‘scales’, ‘lme4’, ‘ggplot2’, ‘reshape2’, ‘BradleyTerry2’" one
by one.
Step 1: install.packages (“minqa”)
Step 2: install.packages (“RcppEigen”)
Step 3: install.packages(“lme4”)
Step 4: install.packages(“ggplot2”)
Step 5: install.packages(“reshape2”)
Step 6: install.packages(“BradleyTerry2”)
Step 7: install.packages("caret",
dependencies = c("Depends", "Suggests"))
Example of
predicting using “glm” method:
library(caret)
library(kernlab)
data(spam)
inTrain <- createDataParition(y=spam$type,p=0.75,list=FALSE)
#partition 75% training and 25% testing
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]
> dim(training)
[1] 3451 58
> dim(testing)
[1] 1150 58
|
|
|
> set.seed(1234)
> fit<-train(type~., data=training, method="glm")
Loading required namespace: e1071
There were 26 warnings (use
warnings() to see them)
> fit
Generalized Linear Model
3451 samples
57 predictor
2 classes: 'nonspam', 'spam'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3451, 3451,
3451, 3451, 3451, 3451, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9207482 0.8330317 0.008444636
0.01755059
|
|
|
> fit$finalModel
Call: NULL
Coefficients:
(Intercept) make
-1.515e+00 -3.393e-01
address all
-1.482e-01 9.183e-02
num3d our
2.531e+00 5.661e-01
over remove
4.999e-01 2.612e+00
internet order
5.661e-01 8.957e-01
mail receive
9.189e-02 -2.957e-01
will people
-1.321e-01 -2.583e-01
report addresses
1.068e-01 1.121e+00
free business
9.468e-01 1.080e+00
email you
1.910e-02 8.164e-02
credit your
1.387e+00 2.326e-01
font num000
3.465e-01 3.525e+00
money hp
1.376e+00 -1.982e+00
hpl george
-1.369e+00 -9.258e+00
num650 lab
9.965e-01 -2.143e+00
labs telnet
-6.141e-01 -1.234e-01
num857 data
2.369e+00 -9.245e-01
num415 num85
1.111e+00 -2.231e+00
technology num1999
7.566e-01 8.572e-02
parts pm
-5.501e-01 -1.005e+00
direct cs
-2.563e-01 -4.692e+01
meeting original
-2.173e+00 -9.787e-01
project re
-1.610e+00 -7.536e-01
edu table
-1.483e+00 -3.167e+00
conference charSemicolon
-4.491e+00 -1.623e+00
charRoundbracket charSquarebracket
1.356e-01 -6.342e-01
charExclamation charDollar
2.497e-01 5.745e+00
charHash capitalAve
2.223e+00 -1.661e-03
capitalLong capitalTotal
8.800e-03 7.263e-04
Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
Null Deviance: 4628
Residual Deviance: 1297 AIC: 1413
PREDICTIONS:
> predictions<- predict(fit, newdata=testing)
> predictions
[1] spam spam spam spam spam
[6] spam nonspam spam spam spam
[11] spam spam spam spam spam
[16] spam nonspam spam spam spam
[21] spam spam spam spam spam
[26] nonspam spam spam spam spam
[31] nonspam spam spam spam spam
[36] spam spam spam spam spam
[41] spam spam spam spam spam
[46] spam spam spam spam spam
[51] spam spam spam nonspam spam
[56] spam spam spam spam spam
[61] spam spam spam spam spam
[66] spam spam spam spam spam
[71] spam spam spam spam nonspam
> confusionMatrix(predictions,testing$type)
Confusion Matrix and Statistics
Reference
Prediction nonspam spam
nonspam 659 50
spam 38 403
Accuracy : 0.9235
95% CI : (0.9066, 0.9382)
No Information Rate : 0.6061
P-Value [Acc > NIR] : <2e-16
Kappa : 0.839
Mcnemar's Test P-Value : 0.241
Sensitivity : 0.9455
Specificity : 0.8896
Pos Pred Value : 0.9295
Neg Pred Value : 0.9138
Prevalence : 0.6061
Detection Rate : 0.5730
Detection Prevalence : 0.6165
Balanced Accuracy : 0.9176
'Positive' Class : nonspam
|
|
|