Monday, 2 February 2015

Prediction of Registered Bike Sharing Users - Data Science Case

Bike Sharing Demand Prediction - Data Analytics Case Study
You must have seen Bike kiosks at the Bangalore Metro station in Whitefield. Bike sharing systems are a new generation of traditional renting bikes and returning them back, at the same or a different location, through an automatic system. Today, there are more than 500 bike sharing programs around the world due to their important role in traffic management, environment and health issues.

Bike riding and sharing is highly dependent on various parameters including weather conditions, seasons, holidays, etc and can affect rental behaviour. The purpose of this case is to understand the rental behaviour of one such bike sharing system from the data set related to two-year historical log corresponding to the years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA, which is publicly available at http://capitalbikeshare.com/system-data.

The dataset can be downloaded from:
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#

The dataset is aggregated on a two hourly and daily basis and then extracted and added to the corresponding weather and seasonal information. Weather information is extracted from http://www.freemeteo.com.

Task:
Predication of bike rental count hourly or daily based on environmental and seasonal settings.

Dataset characteristics
- instant: record index

- dteday : date

- season : season (1:springer, 2:summer, 3:fall, 4:winter)

- yr : year (0: 2011, 1:2012)

- mnth : month ( 1 to 12)

- hr : hour (0 to 23)

- holiday : whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

- weekday : day of the week

- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit :

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- temp : Normalized temperature in Celsius. The values are divided to 41 (max)

- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

- hum: Normalized humidity. The values are divided to 100 (max)

- windspeed: Normalized wind speed. The values are divided to 67 (max)

- casual: count of casual users

- registered: count of registered users

- cnt: count of total rental bikes including both casual and registered



How to Analyse the Data – Data Science Process
Following are the steps:

1. Develop an understanding and purpose of the project -  Prediction of bike rentals.

2. Obtain the dataset - The dataset can be downloaded from UCI public archive:
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#

3. Pre-process the data - Data cleaning

5. Explore the data: Understand the data.

- what variables to keep for the analysis
- how the variables are related to each other
- are there obvious outliers
- are there any missing values

6. Partition the data. Divide the data into training and test data (if we do not have the same separately).

7. Choose the proper technique: Regression, Classification, etc

8. Use the tool and algorithm
- SPSS, SAS or R.

9. Interpret the results.
How well your model has performed. Understand the errors. Test and compare with other models, if required. Validate the results with test dataset.

10. Deploy the model.
Integrate the model with real time systems.

DATA PROCESSING

Read the data from CSV file (dataset)

> bike_train<-read.csv("train.csv",stringsAsFactors=FALSE,header=TRUE)
> str(bike_train)
'data.frame': 10886 obs. of  14 variables:
 $ datetime  : POSIXlt, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ date      : chr  "01-01-01" "01-01-01" "01-01-01" "01-01-01" ...
 $ time      : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
> 
Find the total number of records in the dataset.

Total number of records is 108866

> nrow(bike_train)
[1] 10886






Next step is to convert 'datetime' attribute to DATE format.

> bike_train$datetime <- strptime(bike_train$datetime,"%Y-%m-%d %H:%M:%S")
> bike_train$datetime[1:10]
 [1] "2011-01-01 00:00:00" "2011-01-01 01:00:00" "2011-01-01 02:00:00"
 [4] "2011-01-01 03:00:00" "2011-01-01 04:00:00" "2011-01-01 05:00:00"
 [7] "2011-01-01 06:00:00" "2011-01-01 07:00:00" "2011-01-01 08:00:00"
[10] "2011-01-01 09:00:00"

> class(bike_train$datetime)
[1] "POSIXlt" "POSIXt"
> 

For the analysis, we will split the 'datetime' into two separate columns - date and time

> bike_train$date <- "0"
> bike_train$time <- "0"
> bike_train$date<-format(bike_train$datetime,"%d-%m-%d")
> bike_train$time<-format(bike_train$datetime,"%H:%M:%S")





Next step is to check for NAs/Missing Values. You can use any of the following methods:

>  sapply(bike_train,function(x) sum(is.na(x)))
>  apply(is.na(bike_train),2,sum)
>  colSums(is.na(bike_train))
datetime     season    holiday workingday    weather       temp      atemp   humidity
         0          0          0          0          0          0          0          0
 windspeed     casual registered      count       date       time
         0          0          0          0          0          0

In this dataset, there are no NAs/Missing Values.



Data Exploration

There are 12 variables in the training set. The ‘Count’ variable represents the total rental of casual and registered users. You can predict the total number of users or predict casual users and registered users separately. But, the business goal is to increase the registered users. The outcome variables are - count, casual ad registered.

The predictor is a count/'number' - registered number of users. Hence, we can simply use regression analysis. Multiple regression analysis, simply referred to as linear regression, examines the effect of multiple independent variables on the value of dependent variable, or outcome. In this "Bike Rental" case, outcome variable or dependent variable is the registered number of users. All other variables are independent variables. We have to select those independent variables which have impact on the outcome variable. Multiple regression calculates a coefficient for each independent variable and also its statistical significance. Statistical significance estimates the significance of that variable, the effect of each predictor on the dependent variable, with other predictors held constant. The technique determines the linear relationship; therefore assumption of normality, linearity, and equal variance are carefully observed for each of the independent variable. The beta coefficients are the marginal impacts of each variable, and the size of the weight can be interpreted directly.

Let us first look at these four variables - temp, atemp, humidity, windspeed. Based on the description, temp and atemp seems to be the same except for the measurement scale. So, we can just keep one of the variables, either temp or atemp. Humidity, windspeed, temperature may certainly have impact on bike ride.

Plotting scatter plot of temp vs atemp:

> plot(bike_train$temp,bike_train$atemp)

















How the other variables - holiday, season, working day and time have an impact needs to be explored.

Let us take only three variables at first - humidity, temperature and wind-speed and find out the prediction estimates first. We will examine the other variables  later.

With,  y = temp + windspeed + humidity model resulted with R-squared:  0.162,
which is very low.

How can we improve the model? We have to consider other variables in the dataset:
1. Weekday
2. Season
3. Time
4. Holiday




Prediction – Model 1
We can include temp, humidity and windspeed variables in our linear regression model:

>  lm.fit<-lm(registered ~ temp + humidity + windspeed,data=bike_train)
>  summary(lm.fit)


Call:
lm(formula = registered ~ temp + humidity + windspeed, data = bike_train)

Residuals:
    Min      1Q  Median      3Q     Max
-249.59  -88.21  -33.88   47.33  691.72

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 148.01546    6.93486  21.344   <2e-16 ***
temp          5.88117    0.17061  34.471   <2e-16 ***
humidity     -1.87816    0.07286 -25.778   <2e-16 ***
windspeed     0.37409    0.17141   2.182   0.0291 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 138.3 on 10882 degrees of freedom
Multiple R-squared:  0.162, Adjusted R-squared:  0.1618
F-statistic: 701.3 on 3 and 10882 DF,  p-value: < 2.2e-16

> coefficients(lm.fit)
(Intercept)        temp    humidity   windspeed 
148.0154565   5.8811665  -1.8781570   0.3740919 
> anova(lm.fit)
Analysis of Variance Table

Response: registered
             Df    Sum Sq  Mean Sq   F value  Pr(>F)    
temp          1  25201133 25201133 1317.9165 < 2e-16 ***
humidity      1  14939932 14939932  781.2975 < 2e-16 ***
windspeed     1     91083    91083    4.7633 0.02909 *  
Residuals 10882 208085065    19122                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> vcov(lm.fit)
            (Intercept)          temp      humidity    windspeed
(Intercept)  48.0923075 -0.6614646277 -0.3984977683 -0.647853238
temp         -0.6614646  0.0291077912  0.0009264737  0.001191682
humidity     -0.3984978  0.0009264737  0.0053085096  0.004002497
windspeed    -0.6478532  0.0011916820  0.0040024971  0.029379834
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

layout(matrix(c(1,2,3,4),2,2))
plot(lm.fit)















Prediction Model 2

The regression model with only 3 variables -temp, windspeed and humidity, yielded a predictive power with R-squared value of only 0.16. We have to consider other variables such as season, workingday, holiday, date, and time. It looks like we have to engineer some of the features ourselves.

Workingday variable provides information only about whether that day is a working day or not? 0 indicates it is not a working day and 1 indicates it is a working day.

> table(bike_train$workingday)

   0    1
3474 7412

holiday variable provides how many days were national holidays? From the data, 311 days are national holidays. http://dchr.dc.gov/page/holiday-schedules

> table(bike_train$holiday)

    0     1
10575   311

Further, datetime variable contains both date and time. For the sake of understanding the data, we will divide the datetime into two variables - date and time. And also, let’s add another variable to include the day of the week - Monday, Tuesday, etc.

> bike_train$date <- "0"
> bike_train$time <- "0"
> bike_train$date<-format(bike_train$datetime,"%Y-%m-%d")
> bike_train$time<-format(bike_train$datetime,"%H:%M:%S")

> bike_train$weekday <- as.POSIXlt(bike_train$date)$wday
> str(bike_train)
data.frame': 10886 obs. of  15 variables:
 $ datetime  : POSIXlt, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
.....
....
$ date      : Date, format: "2011-01-01" "2011-01-01" ...
$ time      : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...


Plot the bike sharing activities. From the plot, it can be seen that sharing activities has increased in the year 2012 (both registered and casual).

bike_train$date<-as.Date(bike_train$date)
plot(bike_train$date,bike_train$registered,col="blue")
plot(bike_train$date,bike_train$casual,col="blue")





Add another column with only YEAR
> bike_train$year<-as.integer(format(bike_train$date,"%Y"))


> bike_train$year<- as.factor(bike_train$year)

Following plot shows the weekdays activities and seasonal activities.
It is clear that the bike sharing activities are more during the weekdays (Monday - Friday) than over the holidays and weekends. Season 4 has slightly more activities. Also, there are fewer activities during holidays.

Let's further look at hourly activity. Again, 'datetime' variable should be split into hours. Thanks to 'R', we can do the same using the following functions:


###Breaking down Hours
> bike_train$hour<- as.integer(substr(bike_train$time,1,2))
> bike_train$hour<- as.factor(bike_train$hour)

Plot hourly sharing activity:

> plot(bike_train$hour,bike_train$registered,col="red")





As you can see from the plot, the maximum activity is between 7AM and 9AM  and in the evenings between 5PM to 7PM.

Based on the above analysis, we can include season, weekday, year, and hour, variables in our linear equation and rebuild our model once again.

>  lm.fit<-lm(registered ~ temp + humidity + windspeed + weekday + season + year + hour, data=bike_train)
>  summary(lm.fit)


Call:
lm(formula = registered ~ temp + humidity + windspeed + season + 
    weekday + year + hour, data = bike_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-354.92  -44.62   -1.58   42.24  447.05 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -56.02979    6.41291  -8.737  < 2e-16 ***
temp          3.41618    0.11599  29.453  < 2e-16 ***
humidity     -0.73155    0.05218 -14.019  < 2e-16 ***
windspeed    -0.48243    0.11056  -4.363 1.29e-05 ***
season       18.70139    0.80233  23.309  < 2e-16 ***
weekday       2.08909    0.42015   4.972 6.72e-07 ***
year2012     75.41778    1.69425  44.514  < 2e-16 ***
hour1       -15.30919    5.82010  -2.630  0.00854 **
hour2       -24.07787    5.84085  -4.122 3.78e-05 ***
hour3       -32.86276    5.89405  -5.576 2.53e-08 ***
hour4       -33.95585    5.86720  -5.787 7.35e-09 ***
hour5       -18.60912    5.83458  -3.189  0.00143 **
hour6        36.67045    5.82669   6.294 3.22e-10 ***
hour7       164.73021    5.82224  28.293  < 2e-16 ***
hour8       299.88648    5.81858  51.539  < 2e-16 ***
hour9       143.97430    5.82201  24.729  < 2e-16 ***
hour10       75.81952    5.83540  12.993  < 2e-16 ***
hour11       91.18850    5.85794  15.567  < 2e-16 ***
hour12      123.54632    5.88130  21.007  < 2e-16 ***
hour13      115.47966    5.90946  19.541  < 2e-16 ***
hour14       96.08779    5.93029  16.203  < 2e-16 ***
hour15      106.88257    5.93492  18.009  < 2e-16 ***
hour16      170.89275    5.93159  28.811  < 2e-16 ***
hour17      325.76987    5.91085  55.114  < 2e-16 ***
hour18      305.13681    5.88780  51.825  < 2e-16 ***
hour19      206.67681    5.85621  35.292  < 2e-16 ***
hour20      135.93863    5.83632  23.292  < 2e-16 ***
hour21       92.50632    5.82264  15.887  < 2e-16 ***
hour22       61.70652    5.81716  10.608  < 2e-16 ***
hour23       27.37475    5.81367   4.709 2.52e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 87.73 on 10856 degrees of freedom
Multiple R-squared:  0.6635, Adjusted R-squared:  0.6626
F-statistic: 738.3 on 29 and 10856 DF,  p-value: < 2.2e-16

Residual plots:




New model R-squared is now 0.6635. This is much better than our first model (Registered users).

Similar equation can be modelled for casual users as well.

Call:
lm(formula = casual ~ temp + humidity + windspeed + season +
    weekday + year + hour, data = bike_train)

Residual standard error: 36.57 on 10856 degrees of freedom
Multiple R-squared:  0.4657, Adjusted R-squared:  0.4643 
F-statistic: 326.3 on 29 and 10856 DF,  p-value: < 2.2e-16