Wednesday, 11 March 2015

Caret R Package - classification and regression training

The caret package (short for classification and regression training) contains functions to streamline the model training process for complex regression and classification problems. The caret package is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
  • data splitting
  • pre-processing
  • feature selection
  • model tuning using resampling
  • variable importance estimation
Following are the steps to install caret package (it has many dependencies).

Install "‘minqa’, ‘RcppEigen’, ‘scales’, ‘lme4’, ‘ggplot2’, ‘reshape2’, ‘BradleyTerry2’" one by one.

Step 1: install.packages (“minqa”)
Step 2: install.packages (“RcppEigen”)
Step 3: install.packages(“lme4”)
Step 4: install.packages(“ggplot2”)
Step 5: install.packages(“reshape2”)
Step 6: install.packages(“BradleyTerry2”)
Step 7:  install.packages("caret", dependencies = c("Depends", "Suggests"))

Example of predicting using “glm” method:
inTrain <- createDataParition(y=spam$type,p=0.75,list=FALSE)   #partition 75% training and 25% testing
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]
> dim(training)
[1] 3451   58
> dim(testing)
[1] 1150   58

> set.seed(1234)
> fit<-train(type~., data=training, method="glm")
Loading required namespace: e1071
There were 26 warnings (use warnings() to see them)
> fit
Generalized Linear Model

3451 samples
  57 predictor
   2 classes: 'nonspam', 'spam'

No pre-processing
Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD 
  0.9207482  0.8330317  0.008444636  0.01755059

> fit$finalModel
Call:  NULL
      (Intercept)               make  
       -1.515e+00         -3.393e-01  
          address                all  
       -1.482e-01          9.183e-02  
            num3d                our  
        2.531e+00          5.661e-01  
             over             remove  
        4.999e-01          2.612e+00  
         internet              order  
        5.661e-01          8.957e-01  
             mail            receive  
        9.189e-02         -2.957e-01  
             will             people  
       -1.321e-01         -2.583e-01  
           report          addresses  
        1.068e-01          1.121e+00  
             free           business  
        9.468e-01          1.080e+00  
            email                you  
        1.910e-02          8.164e-02  
           credit               your  
        1.387e+00          2.326e-01  
             font             num000  
        3.465e-01          3.525e+00  
            money                 hp  
        1.376e+00         -1.982e+00  
              hpl             george  
       -1.369e+00         -9.258e+00  
           num650                lab  
        9.965e-01         -2.143e+00  
             labs             telnet  
       -6.141e-01         -1.234e-01  
           num857               data  
        2.369e+00         -9.245e-01  
           num415              num85  
        1.111e+00         -2.231e+00  
       technology            num1999  
        7.566e-01          8.572e-02  
            parts                 pm  
       -5.501e-01         -1.005e+00  
           direct                 cs  
       -2.563e-01         -4.692e+01  
          meeting           original  
       -2.173e+00         -9.787e-01  
          project                 re  
       -1.610e+00         -7.536e-01  
              edu              table  
       -1.483e+00         -3.167e+00  
       conference      charSemicolon  
       -4.491e+00         -1.623e+00  
 charRoundbracket  charSquarebracket  
        1.356e-01         -6.342e-01  
  charExclamation         charDollar  
        2.497e-01          5.745e+00  
         charHash         capitalAve  
        2.223e+00         -1.661e-03  
      capitalLong       capitalTotal  
        8.800e-03          7.263e-04  
Degrees of Freedom: 3450 Total (i.e. Null);  3393 Residual
Null Deviance:     4628 
Residual Deviance: 1297        AIC: 1413


> predictions<- predict(fit, newdata=testing)
> predictions
   [1] spam    spam    spam    spam    spam   
   [6] spam    nonspam spam    spam    spam   
  [11] spam    spam    spam    spam    spam   
  [16] spam    nonspam spam    spam    spam   
  [21] spam    spam    spam    spam    spam   
  [26] nonspam spam    spam    spam    spam   
  [31] nonspam spam    spam    spam    spam   
  [36] spam    spam    spam    spam    spam   
  [41] spam    spam    spam    spam    spam   
  [46] spam    spam    spam    spam    spam   
  [51] spam    spam    spam    nonspam spam   
  [56] spam    spam    spam    spam    spam   
  [61] spam    spam    spam    spam    spam   
  [66] spam    spam    spam    spam    spam   
  [71] spam    spam    spam    spam    nonspam

> confusionMatrix(predictions,testing$type)
Confusion Matrix and Statistics

Prediction nonspam spam
   nonspam     659   50
   spam         38  403
               Accuracy : 0.9235         
                 95% CI : (0.9066, 0.9382)
    No Information Rate : 0.6061          
    P-Value [Acc > NIR] : <2e-16         
                  Kappa : 0.839          
 Mcnemar's Test P-Value : 0.241          
            Sensitivity : 0.9455          
            Specificity : 0.8896         
         Pos Pred Value : 0.9295         
         Neg Pred Value : 0.9138         
             Prevalence : 0.6061         
         Detection Rate : 0.5730         
   Detection Prevalence : 0.6165          
      Balanced Accuracy : 0.9176         
       'Positive' Class : nonspam        