Wednesday, 11 March 2015

Caret R Package - classification and regression training

Caret R Package - classification and regression training
(http://topepo.github.io/caret/index.html)

The caret package (short for classification and regression training) contains functions to streamline the model training process for complex regression and classification problems. The caret package is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
  • data splitting
  • pre-processing
  • feature selection
  • model tuning using resampling
  • variable importance estimation
Following are the steps to install caret package (it has many dependencies).

Install "‘minqa’, ‘RcppEigen’, ‘scales’, ‘lme4’, ‘ggplot2’, ‘reshape2’, ‘BradleyTerry2’" one by one.

Step 1: install.packages (“minqa”)
Step 2: install.packages (“RcppEigen”)
Step 3: install.packages(“lme4”)
Step 4: install.packages(“ggplot2”)
Step 5: install.packages(“reshape2”)
Step 6: install.packages(“BradleyTerry2”)
Step 7:  install.packages("caret", dependencies = c("Depends", "Suggests"))

Example of predicting using “glm” method:
library(caret)
library(kernlab)
data(spam)
inTrain <- createDataParition(y=spam$type,p=0.75,list=FALSE)   #partition 75% training and 25% testing
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]
> dim(training)
[1] 3451   58
> dim(testing)
[1] 1150   58
>

> set.seed(1234)
> fit<-train(type~., data=training, method="glm")
Loading required namespace: e1071
There were 26 warnings (use warnings() to see them)
> fit
Generalized Linear Model

3451 samples
  57 predictor
   2 classes: 'nonspam', 'spam'

No pre-processing
Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD 
  0.9207482  0.8330317  0.008444636  0.01755059


>
> fit$finalModel
 
Call:  NULL
 
Coefficients:
      (Intercept)               make  
       -1.515e+00         -3.393e-01  
          address                all  
       -1.482e-01          9.183e-02  
            num3d                our  
        2.531e+00          5.661e-01  
             over             remove  
        4.999e-01          2.612e+00  
         internet              order  
        5.661e-01          8.957e-01  
             mail            receive  
        9.189e-02         -2.957e-01  
             will             people  
       -1.321e-01         -2.583e-01  
           report          addresses  
        1.068e-01          1.121e+00  
             free           business  
        9.468e-01          1.080e+00  
            email                you  
        1.910e-02          8.164e-02  
           credit               your  
        1.387e+00          2.326e-01  
             font             num000  
        3.465e-01          3.525e+00  
            money                 hp  
        1.376e+00         -1.982e+00  
              hpl             george  
       -1.369e+00         -9.258e+00  
           num650                lab  
        9.965e-01         -2.143e+00  
             labs             telnet  
       -6.141e-01         -1.234e-01  
           num857               data  
        2.369e+00         -9.245e-01  
           num415              num85  
        1.111e+00         -2.231e+00  
       technology            num1999  
        7.566e-01          8.572e-02  
            parts                 pm  
       -5.501e-01         -1.005e+00  
           direct                 cs  
       -2.563e-01         -4.692e+01  
          meeting           original  
       -2.173e+00         -9.787e-01  
          project                 re  
       -1.610e+00         -7.536e-01  
              edu              table  
       -1.483e+00         -3.167e+00  
       conference      charSemicolon  
       -4.491e+00         -1.623e+00  
 charRoundbracket  charSquarebracket  
        1.356e-01         -6.342e-01  
  charExclamation         charDollar  
        2.497e-01          5.745e+00  
         charHash         capitalAve  
        2.223e+00         -1.661e-03  
      capitalLong       capitalTotal  
        8.800e-03          7.263e-04  
 
Degrees of Freedom: 3450 Total (i.e. Null);  3393 Residual
Null Deviance:     4628 
Residual Deviance: 1297        AIC: 1413

PREDICTIONS:

> predictions<- predict(fit, newdata=testing)
> predictions
   [1] spam    spam    spam    spam    spam   
   [6] spam    nonspam spam    spam    spam   
  [11] spam    spam    spam    spam    spam   
  [16] spam    nonspam spam    spam    spam   
  [21] spam    spam    spam    spam    spam   
  [26] nonspam spam    spam    spam    spam   
  [31] nonspam spam    spam    spam    spam   
  [36] spam    spam    spam    spam    spam   
  [41] spam    spam    spam    spam    spam   
  [46] spam    spam    spam    spam    spam   
  [51] spam    spam    spam    nonspam spam   
  [56] spam    spam    spam    spam    spam   
  [61] spam    spam    spam    spam    spam   
  [66] spam    spam    spam    spam    spam   
  [71] spam    spam    spam    spam    nonspam

> confusionMatrix(predictions,testing$type)
Confusion Matrix and Statistics

          Reference
Prediction nonspam spam
   nonspam     659   50
   spam         38  403
                                         
               Accuracy : 0.9235         
                 95% CI : (0.9066, 0.9382)
    No Information Rate : 0.6061          
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.839          
 Mcnemar's Test P-Value : 0.241          
                                         
            Sensitivity : 0.9455          
            Specificity : 0.8896         
         Pos Pred Value : 0.9295         
         Neg Pred Value : 0.9138         
             Prevalence : 0.6061         
         Detection Rate : 0.5730         
   Detection Prevalence : 0.6165          
      Balanced Accuracy : 0.9176         
                                         
       'Positive' Class : nonspam        
                                         
>