**Bike Sharing Demand Prediction - Data Analytics Case Study**

You must have seen Bike kiosks at the Bangalore
Metro station in Whitefield. Bike sharing systems are a new generation of
traditional renting bikes and returning them back, at the same or a different
location, through an automatic system. Today, there are more than 500 bike
sharing programs around the world due to their important role in traffic
management, environment and health issues.

Bike riding and sharing is highly dependent on various parameters including weather conditions, seasons, holidays, etc and can affect rental behaviour. The purpose of this case is to understand the rental behaviour of one such bike sharing system from the data set related to two-year historical log corresponding to the years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA, which is publicly available at http://capitalbikeshare.com/system-data.

The dataset can be downloaded from:

http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#

The dataset is aggregated on a two hourly and daily basis and then extracted and added to the corresponding weather and seasonal information. Weather information is extracted from http://www.freemeteo.com.

Predication of bike rental count hourly or daily based on environmental and seasonal settings.

- instant: record index

- dteday : date

- season : season (1:springer, 2:summer, 3:fall, 4:winter)

- yr : year (0: 2011, 1:2012)

- mnth : month ( 1 to 12)

- hr : hour (0 to 23)

- holiday : whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

- weekday : day of the week

- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit :

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- temp : Normalized temperature in Celsius. The values are divided to 41 (max)

- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

- hum: Normalized humidity. The values are divided to 100 (max)

- windspeed: Normalized wind speed. The values are divided to 67 (max)

- casual: count of casual users

- registered: count of registered users

- cnt: count of total rental bikes including both casual and registered

Bike riding and sharing is highly dependent on various parameters including weather conditions, seasons, holidays, etc and can affect rental behaviour. The purpose of this case is to understand the rental behaviour of one such bike sharing system from the data set related to two-year historical log corresponding to the years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA, which is publicly available at http://capitalbikeshare.com/system-data.

The dataset can be downloaded from:

http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#

The dataset is aggregated on a two hourly and daily basis and then extracted and added to the corresponding weather and seasonal information. Weather information is extracted from http://www.freemeteo.com.

**Task:**Predication of bike rental count hourly or daily based on environmental and seasonal settings.

**Dataset characteristics**- instant: record index

- dteday : date

- season : season (1:springer, 2:summer, 3:fall, 4:winter)

- yr : year (0: 2011, 1:2012)

- mnth : month ( 1 to 12)

- hr : hour (0 to 23)

- holiday : whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

- weekday : day of the week

- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit :

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- temp : Normalized temperature in Celsius. The values are divided to 41 (max)

- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

- hum: Normalized humidity. The values are divided to 100 (max)

- windspeed: Normalized wind speed. The values are divided to 67 (max)

- casual: count of casual users

- registered: count of registered users

- cnt: count of total rental bikes including both casual and registered

**How to Analyse the Data – Data Science Process**

Following are the steps:

1. Develop an understanding and purpose of the
project - Prediction of bike rentals.

2. Obtain the dataset - The dataset can be
downloaded from UCI public archive:

http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#

3. Pre-process the data - Data cleaning

5. Explore the data: Understand the data.

- what variables to keep for the analysis

- how the variables are related to each other

- are there obvious outliers

- are there any missing values

6. Partition the data. Divide the data into
training and test data (if we do not have the same separately).

7. Choose the proper technique: Regression,
Classification, etc

8. Use the tool and algorithm

- SPSS, SAS or R.

9. Interpret the results.

How well your model has performed. Understand
the errors. Test and compare with other models, if required. Validate the
results with test dataset.

10. Deploy the model.

Integrate the model with real time systems.

**DATA PROCESSING**

Read the data from CSV file
(dataset)

>
bike_train<-read.csv("train.csv",stringsAsFactors=FALSE,header=TRUE)

> str(bike_train)

'data.frame': 10886 obs. of 14
variables:

$ datetime : POSIXlt,
format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...

$ season : int
1 1 1 1 1 1 1 1 1 1 ...

$ holiday : int 0
0 0 0 0 0 0 0 0 0 ...

$ workingday: int 0 0 0
0 0 0 0 0 0 0 ...

$ weather : int 1
1 1 1 1 2 1 1 1 1 ...

$ temp :
num 9.84 9.02 9.02 9.84 9.84 ...

$ atemp : num
14.4 13.6 13.6 14.4 14.4 ...

$ humidity : int
81 80 80 75 75 75 80 86 75 76 ...

$ windspeed : num 0 0 0
0 0 ...

$ casual : int
3 8 5 3 0 0 2 1 1 8 ...

$ registered: int 13 32
27 10 1 1 0 2 7 6 ...

$ count : int
16 40 32 13 1 1 2 3 8 14 ...

$ date :
chr "01-01-01" "01-01-01" "01-01-01"
"01-01-01" ...

$ time :
chr "00:00:00" "01:00:00" "02:00:00"
"03:00:00" ...

>

Find the total number of records in
the dataset.

Total number of records is 108866

> nrow(bike_train)

[1] 10886

Next step is to convert 'datetime'
attribute to DATE format.

> bike_train$datetime <-
strptime(bike_train$datetime,"%Y-%m-%d %H:%M:%S")

> bike_train$datetime[1:10]

[1] "2011-01-01
00:00:00" "2011-01-01 01:00:00" "2011-01-01 02:00:00"

[4] "2011-01-01
03:00:00" "2011-01-01 04:00:00" "2011-01-01 05:00:00"

[7] "2011-01-01
06:00:00" "2011-01-01 07:00:00" "2011-01-01 08:00:00"

[10] "2011-01-01 09:00:00"

> class(bike_train$datetime)

[1] "POSIXlt"
"POSIXt"

>

For the analysis, we will split the
'datetime' into two separate columns - date and time

> bike_train$date <-
"0"

> bike_train$time <-
"0"

>
bike_train$date<-format(bike_train$datetime,"%d-%m-%d")

>
bike_train$time<-format(bike_train$datetime,"%H:%M:%S")

Next step is to check for
NAs/Missing Values. You can use any of the following methods:

>
sapply(bike_train,function(x) sum(is.na(x)))

>
apply(is.na(bike_train),2,sum)

>
colSums(is.na(bike_train))

datetime season
holiday workingday weather temp
atemp humidity

0
0 0
0 0
0 0
0

windspeed casual
registered count date
time

0
0 0
0 0
0

In this dataset, there are no
NAs/Missing Values.

**Data Exploration**

There are 12 variables in the training set. The ‘Count’ variable
represents the total rental of casual and registered users. You can predict the
total number of users or predict casual users and registered users separately.
But, the business goal is to increase the registered users. The outcome
variables are - count, casual ad registered.

The predictor is a count/'number' - registered number of users.
Hence, we can simply use regression analysis. Multiple regression analysis,
simply referred to as linear regression, examines the effect of multiple
independent variables on the value of dependent variable, or outcome. In this
"Bike Rental" case, outcome variable or dependent variable is the registered
number of users. All other variables are independent variables. We have to
select those independent variables which have impact on the outcome variable.
Multiple regression calculates a coefficient for each independent variable and
also its statistical significance. Statistical significance estimates the significance
of that variable, the effect of each predictor on the dependent variable, with
other predictors held constant. The technique determines the linear
relationship; therefore assumption of normality, linearity, and equal variance
are carefully observed for each of the independent variable. The beta
coefficients are the marginal impacts of each variable, and the size of the
weight can be interpreted directly.

Let us first look at these four variables - temp, atemp, humidity,
windspeed. Based on the description, temp and atemp seems to be the same except
for the measurement scale. So, we can just keep one of the variables, either
temp or atemp. Humidity, windspeed, temperature may certainly have impact on
bike ride.

Plotting scatter plot of temp vs
atemp:

>
plot(bike_train$temp,bike_train$atemp)

How the other variables - holiday, season, working day and time
have an impact needs to be explored.

Let us take only three variables at first - humidity, temperature
and wind-speed and find out the prediction estimates first. We will examine the
other variables later.

With, y = temp + windspeed + humidity model resulted with
R-squared: 0.162,

which is very low.

How can we improve the model? We have to consider other variables
in the dataset:

1. Weekday

2. Season

3. Time

4. Holiday

**Prediction – Model 1**

We can include temp, humidity and
windspeed variables in our linear regression model:

>
lm.fit<-lm(registered ~ temp + humidity + windspeed,data=bike_train)

>
summary(lm.fit)

Call:

lm(formula =
registered ~ temp + humidity + windspeed, data = bike_train)

Residuals:

Min
1Q Median 3Q Max

-249.59 -88.21
-33.88 47.33 691.72

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 148.01546
6.93486 21.344 <2e-16 ***

temp
5.88117 0.17061 34.471 <2e-16
***

humidity
-1.87816 0.07286 -25.778 <2e-16 ***

windspeed
0.37409 0.17141 2.182 0.0291 *

---

Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard
error: 138.3 on 10882 degrees of freedom

Multiple

**R-squared: 0.162**, Adjusted**R-squared: 0.1618**
F-statistic: 701.3 on
3 and 10882 DF, p-value: < 2.2e-16

>
coefficients(lm.fit)

(Intercept)
temp humidity windspeed

148.0154565
5.8811665 -1.8781570 0.3740919

>

> anova(lm.fit)

Analysis of Variance
Table

Response: registered

Df Sum Sq Mean Sq F value
Pr(>F)

temp
1 25201133 25201133 1317.9165 < 2e-16 ***

humidity
1 14939932 14939932 781.2975 < 2e-16 ***

windspeed
1 91083 91083 4.7633 0.02909 *

Residuals 10882
208085065 19122

---

Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

>

> vcov(lm.fit)

(Intercept) temp
humidity windspeed

(Intercept)
48.0923075 -0.6614646277 -0.3984977683 -0.647853238

temp
-0.6614646 0.0291077912 0.0009264737
0.001191682

humidity
-0.3984978 0.0009264737 0.0053085096 0.004002497

windspeed
-0.6478532 0.0011916820 0.0040024971 0.029379834

>

Diagnostic plots provide checks for
heteroscedasticity, normality, and influential observerations.

layout(matrix(c(1,2,3,4),2,2))

plot(lm.fit)

**Prediction Model 2**

The regression model with only 3
variables -temp, windspeed and humidity, yielded a predictive power with
R-squared value of only 0.16. We have to consider other variables such as
season, workingday, holiday, date, and time. It looks like we have to engineer
some of the features ourselves.

Workingday variable provides
information only about whether that day is a working day or not? 0 indicates it
is not a working day and 1 indicates it is a working day.

>
table(bike_train$workingday)

0
1

3474 7412

holiday variable provides how many days
were national holidays? From the data, 311 days are national holidays. http://dchr.dc.gov/page/holiday-schedules

>
table(bike_train$holiday)

0
1

10575 311

Further, datetime variable contains
both date and time. For the sake of understanding the data, we will divide the
datetime into two variables - date and time. And also, let’s add another
variable to include the day of the week - Monday, Tuesday, etc.

> bike_train$date
<- "0"

> bike_train$time
<- "0"

>
bike_train$date<-format(bike_train$datetime,"%Y-%m-%d")

>
bike_train$time<-format(bike_train$datetime,"%H:%M:%S")

>
bike_train$weekday <- as.POSIXlt(bike_train$date)$wday

> str(bike_train)

data.frame': 10886
obs. of 15 variables:

$ datetime
: POSIXlt, format: "2011-01-01 00:00:00" "2011-01-01
01:00:00" ...

$ season
: int 1 1 1 1 1 1 1 1 1 1 ...

.....

....

$ date
: Date, format: "2011-01-01" "2011-01-01" ...

$ time
: chr "00:00:00" "01:00:00"
"02:00:00" "03:00:00" ...

Plot the bike sharing activities. From
the plot, it can be seen that sharing activities has increased in the year 2012
(both registered and casual).

bike_train$date<-as.Date(bike_train$date)

plot(bike_train$date,bike_train$registered,col="blue")

plot(bike_train$date,bike_train$casual,col="blue")

Add another column
with only YEAR

>
bike_train$year<-as.integer(format(bike_train$date,"%Y"))

>
bike_train$year<- as.factor(bike_train$year)

Following plot shows the weekdays
activities and seasonal activities.

It is clear that the bike sharing
activities are more during the weekdays (Monday - Friday) than over the
holidays and weekends. Season 4 has slightly more activities. Also, there are fewer
activities during holidays.

Let's further look at hourly activity.
Again, 'datetime' variable should be split into hours. Thanks to 'R', we can do
the same using the following functions:

###Breaking down
Hours

>
bike_train$hour<- as.integer(substr(bike_train$time,1,2))

>
bike_train$hour<- as.factor(bike_train$hour)

Plot hourly sharing
activity:

>
plot(bike_train$hour,bike_train$registered,col="red")

As you can see from the plot, the maximum
activity is between 7AM and 9AM and in
the evenings between 5PM to 7PM.

Based on the above analysis, we can
include season, weekday, year, and hour, variables in our linear equation and
rebuild our model once again.

>
lm.fit<-lm(registered ~ temp + humidity + windspeed + weekday + season
+ year + hour, data=bike_train)

>
summary(lm.fit)

Call:

lm(formula =
registered ~ temp + humidity + windspeed + season +

weekday
+ year + hour, data = bike_train)

Residuals:

Min
1Q Median 3Q
Max

-354.92 -44.62
-1.58 42.24 447.05

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -56.02979
6.41291 -8.737 < 2e-16 ***

temp
3.41618 0.11599 29.453 < 2e-16
***

humidity
-0.73155 0.05218 -14.019 < 2e-16 ***

windspeed
-0.48243 0.11056 -4.363 1.29e-05 ***

season
18.70139 0.80233 23.309 < 2e-16 ***

weekday
2.08909 0.42015 4.972 6.72e-07 ***

year2012
75.41778 1.69425 44.514 < 2e-16 ***

hour1
-15.30919 5.82010 -2.630 0.00854 **

hour2
-24.07787 5.84085 -4.122 3.78e-05 ***

hour3
-32.86276 5.89405 -5.576 2.53e-08 ***

hour4
-33.95585 5.86720 -5.787 7.35e-09 ***

hour5
-18.60912 5.83458 -3.189 0.00143 **

hour6
36.67045 5.82669 6.294 3.22e-10 ***

hour7
164.73021 5.82224 28.293 < 2e-16 ***

hour8
299.88648 5.81858 51.539 < 2e-16 ***

hour9
143.97430 5.82201 24.729 < 2e-16 ***

hour10
75.81952 5.83540 12.993 < 2e-16 ***

hour11
91.18850 5.85794 15.567 < 2e-16 ***

hour12
123.54632 5.88130 21.007 < 2e-16 ***

hour13
115.47966 5.90946 19.541 < 2e-16 ***

hour14
96.08779 5.93029 16.203 < 2e-16 ***

hour15
106.88257 5.93492 18.009 < 2e-16 ***

hour16
170.89275 5.93159 28.811 < 2e-16 ***

hour17
325.76987 5.91085 55.114 < 2e-16 ***

hour18
305.13681 5.88780 51.825 < 2e-16 ***

hour19
206.67681 5.85621 35.292 < 2e-16 ***

hour20
135.93863 5.83632 23.292 < 2e-16 ***

hour21
92.50632 5.82264 15.887 < 2e-16 ***

hour22
61.70652 5.81716 10.608 < 2e-16 ***

hour23
27.37475 5.81367 4.709 2.52e-06 ***

---

Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 87.73 on 10856 degrees of freedom

Multiple

**R-squared: 0.6635,**Adjusted R-squared: 0.6626
F-statistic: 738.3 on 29 and 10856 DF, p-value: < 2.2e-16

Residual plots:

New model

**R-squared is now 0.6635**. This is much better than our first model (Registered users).
Similar equation can be modelled for
casual users as well.

Call:

lm(formula = casual ~
temp + humidity + windspeed + season +

weekday
+ year + hour, data = bike_train)

Residual standard
error: 36.57 on 10856 degrees of freedom

Multiple

**R-squared: 0.4657**, Adjusted R-squared: 0.4643
F-statistic: 326.3 on
29 and 10856 DF, p-value: < 2.2e-16