Tuesday, 9 August 2016

Big Data Analytics - Video

Big Data Analytics:  video.


Sunday, 8 May 2016

What is Sentiment Analysis ?

What is Sentiment?
“Opinions” are key influencers of our behaviours. Our beliefs and perceptions of the reality are conditioned by how others see the world. Whenever we need to make decisions, we often seek “opinions” of others. In the past, individuals sought opinions from friends and family and organizations use surveys, focus groups, opinion polls, etc. With more than 700 million people using online social media, such as facebook, twitter etc., to communicate with each other across the globe, companies are looking this as an opportunity to reach people and do business. Since businesses rely largely on word of mouth marketing, the social media has now become e-wom (electronic word of mouth) marketing tool. Even the customers have become smart by constantly looking at product ratings, reviews, blogs, micro-blogging, before making purchase decisions. All of these social media technologies are changing the customer experience and are increasingly being used to connect with customers to build strong relationships and converting a regular customer into a brand advocate. Online social media and e-wom, have rapidly changed the e-commerce to a new face called social media business or social commerce or social media marketing.

The internet and the web have changed the way people communicate. People can now post their feelings or opinions on the web freely. They can write about product reviews, express their views about the services in any forums, company websites or e-mails, blogs or social sites like facebook. If one wants to shop for a new product, he or she can go to any forums or website to check the particular product reviews and make decision to buy or not. For a company, this is a new marketing challenge and also it may not need to conduct any surveys or feedback to find out customer satisfaction about its products and services, and how competitors are doing, as it can now be available for the companies instantly.

According to the recent survey conducted Local Consumer Review Survey (2012), “approximately 72% of consumers surveyed said that they trust online reviews as much as personal recommendations, while 52% said that positive online reviews make them more likely to use a local business.”

According to Nielsen, a global leader in measurement and information, “Thirty-six percent of global online consumers report trust in online video ads and 40 percent say they believe ads viewed in search engine results. Sponsored ads on social networking sites are deemed credible only by 36 percent of global respondents. However, in India, the numbers are higher with 48 percent online consumers trusting online video ads and 52 percent believing ads viewed in search engine results. Sponsored ads on social networking sites fare better with 54 percent of respondents trusting this form of advertising.”

What is Sentiment Analysis?
Opinion mining or sentiment analysis is the process of determining the sentiment or opinion of a given topic a document. Political parties may be interested to know whether people are supporting their program or not. Social organizations may want to find out people’s opinion on current debates.  Cell Phone Company may be interested to know:
a.       What users are saying when a product is launched
b.      Which features are liked most
c.       What features they do not like
d.      Are they talking positive or negative

Finding opinion from the web sources can be a formidable task because of huge volume of data (text). It is difficult for a human reader to go over the 1000s of reviews to form an opinion. In many cases, opinions are hidden in a long discussion posts or blogs. Thus, it is essential to have an opinion discovery and summarization system which can do these formidable tasks automatically using Natural Language Processing and machine learning techniques.

In general, opinions can be expressed on products, services, individuals, organizations, an event or a subject. An opinion expressed consists of a target entity (an object) that has been commented on and its attributes (or properties). Each object can have a set of components and a set of attributes. Thus, an object can be hierarchically decomposed based on the relationship.

To understand better, let us look at the following review comment of Nikon 300s camera from bestbuy.com:

“I love this camera (1)!! No regrets whatsoever......highly recommended(2). I specially love the quality of the pictures(3)! I will be using this primarily for my photography business and I am truly satisfied with it (4). I also love the easy access to some of the most used features (mode, shutter, HD video and aperture) (5). I went from a Canon Rebel to a D300s(6)!! What an upgrade(7)! For ladies, I recommend Nikon D7000, it's lighter and similar to D300s, but with Half-plastic body(8). But I got a little cons for it, high ISO noise control, above 800 ISO, you can clearly distinguish the noise points on your shooting (9). Anti-dust system is effective but a little bit noisy(10). I continue to learn/discover something new about the features on this camera....Very happy with it(11)!! I would recommend this to a friend!(12)”

There are several opinions in this review. There are positive opinions (1,2,12) and negative opinions (9,10). There are opinion comparisons (6, 7), opinions regarding specific features (5,8). However, the overall review opinion is positive (12). The first two sentences (1, 2) expresses opinion on the camera as a whole. Then we notice that the opinions expressed in next sentences have some targets or objects on which the opinions are expressed. For example, in sentence 3 the opinion is on the quality of the picture. In sentence 4, the opinion is on the use of camera. Similarly, in sentence 5, the opinion is about specific feature – mode, shutter, HD video. In sentence 6, the opinion expressed is comparing Nikon 300s with Canon Rebel camera and Nikon 7000. The sentence 9 and 10 expresses negative aspects of the camera – noise and anti-dust system. In the last sentence, opinion is recommending to a friend. With this example in mind, we now formally define the sentiment analysis or opinion mining problem.
...................

Wednesday, 4 May 2016

Goodness of fit and the coefficient of determination r2(R-square)


Simple Linear Regression
Linear regression is the most common prediction technique that is used today. The term regression was introduced by Francis Galton. In his paper, Galton found that, “although there was a tendency for tall parents to have tall children and for short parents to have short children, the average height of children born of parents of a given height tended to move or “regress” towards the average height in population as a whole.” In other words, the height of the children of unusually tall or unusually short parents tends to move towards the average height of the population. His friend Karl Pearson, collected more than thousand records of heights of members of a family group and confirmed Galton’s theory of universal regression.

The modern interpretation of regression is however quite different. Linear regression is concerned with the study of two variables X and Y. Where Y is dependent on variable X, X and Y have linear (straight line or slope is constant) relationship and data is normally distributed.

In my discussion, I will be using the term “variable”, a quantity that variable (If it didn't vary, it would be a constant). There are two types of variables, one, dependent variable, called predictor variable, which is dependent on other variables called explanatory variable or independent variable, which do not vary independently (in a statistical sense), but that they tend to vary together. Depending on the context, an independent variable is also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable." A dependent variable is also known as a "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable.

In liner model, we try to “fit” the data to a straight line function or linear function. In other words, variable Y is varying as straight line of another variable X. Data points, are not always follow linear model. A measure of absolute amount of “variability” in a variable is called its variance, which is defined a its average squared deviation from mean. A linear regression line has an equation of the form:
Y = b0 + b1x


Where x is the explanatory variable and Y is the dependent variable. Y, which is read as “expected value of”, indicates a population mean, for Y/x, which is read “Y given x", indicates that we are looking at the possible values of Y when x is restricted to some single value. The slope of the line is b1, and bo is the intercept (the value of y when x = 0).

However, coefficients, b1 and bo, are calculated using the data available and in real life, “prediction” is to predict or forecast the values outside the range of given X values using the equation and coefficient values calculated earlier. The X value is “extrapolated”, assuming that it will be linear. The model is essentially the assumption of “linearity", at least within the range of the observed explanatory data.

There are two ways in which you can compute the values of the parameters to fit a function.    1). Ordinary Least Squares (OLS) method and 2). Maximum Likelihood Method (ML). Nowadays, the least square method (OLS) is widely used to find the numerical values of the parameters to fit a function to a set of data. It is one of the oldest method of statistics and it was first published by the French mathematician, Legendre in the year 1805. After the publication of Legender’s memoir, the famous German mathematician, Carl Friedrich Gauss, published in another memoir in which he mentioned that he had used this method previously as early as 1795.

Since there are many statistics books explain calculations of parameters b0 and b1, I will not make an attempt here to explain the same thing. You can refer any statistics book for detailed calculation of the parameters (estimates).




 Sometimes the symbols  and  are used to represent b1and bo, even though these have Greek letters in them, the “^”, the “hat”, over the b1and bo,  tells that we are dealing with statistics not just parameters.

A linear regression model is going to attempt, using the least squares formulas, to fit a straight line to this set of data. But it is clear that the association between x and y is not linear.





The Least Square (OLS) method is based on following assumptions:
  1. 1. The regression model is linear in the parameters
  2. 2.    In the given data, independent variables (X) are independent from one another
  3. 3.    At every value of X, the observed points should follow roughly normal distribution centered at the fitted value of Y
  4. 4.    Homoscedasticity or equal variance – Given the value of X, the variance is same for all values of X


 Hence, in the above method, though the objective is to find the values of b1and bo, but also to know how close the values are to their counterparts in true real world population or data. In other words, how good is our “prediction” model? Therefore, we need some measure to measure the “reliability” or “precision” of the estimators b1and bo.
In statistics, the precision of an estimate is measured by what is called as “standard error (se)”.



In my next section, I will explain “Goodness of fit” and the coefficient of determination r2

References:
  1. 1.    Galton, Francis.,  “Family Likeness in Stature,”, Proceedings of Royal Society, London, Vol 40, 1886, pp 42-72
  2. 2.    Pearson K., and Lee, A., “On the Laws of Inheritance,” Biometrika, Vol 2, Nov. 1903, pp 357-462.
  3. 3.    Dodge, Y. “The Oxford Dictionary of Statistical Terms”. 2003, OUP. ISBN 0-19-920613-9
  4. 4.    Gujarathi, D., “The Basic Econometrics”. 2004. Tata McGRaw-Hill
  5. 5.    Plackett, R.L., “The discovery of the method of least squares”, 1972, Biometrika, 59, 239–251.
  6. 6.    Seal, H.L., “The historical development of the Gauss linear model”. 1967, Biometrika, 54,1–23


In the above section I discussed about how to calculate regression coefficients and their standard errors. We now consider how well the points are fitted to a line, also called as “goodness of fit”. By plotting “scatter plot” of the X and Y variable, you will be able to analyze the relationship between two paired variables. It is clear from the figure 1, that if all the observations were to lie on the regression line, we would obtain a perfect fit. In a perfect fit, there would no difference, with the points plotting right on the line.

Figure 1

However, this is not always true in real world. Some points (u1,u4) may be above the line (positive) and some points(u3,u4) may be below the line (negative). We hope this error, called residual error, should be as small as possible.


Coefficient of determination, r2, indicates the extent to which the dependent variable is predictable. Let me explain this with the help of a Venn diagram as shown in the figure 2. In this figure circle X represents the variation of X and circle Y represents the variation of Y. 

The overlap of the two circles (shaded area) indicates the extent to which the variation in Y is explained by the variation in X. The greater the extent of the overlap, the greater the variation in Y explained by X. The r2 is simply the measure of overlap. When there is no overlap, r2 is zero and when the overlap is complete, r2 is 1, since 100 percent of variation in Y is explained by X.

References:
  1. 1.    Galton, Francis.,  “Family Likeness in Stature,”, Proceedings of Royal Society, London, Vol 40, 1886, pp 42-72
  2. 2.    Pearson K., and Lee, A., “On the Laws of Inheritance,” Biometrika, Vol 2, Nov. 1903, pp 357-462.
  3. 3.    Dodge, Y. “The Oxford Dictionary of Statistical Terms”. 2003, OUP. ISBN 0-19-920613-9
  4. 4.    Gujarathi, D., “The Basic Econometrics”. 2004. Tata McGRaw-Hill
  5. 5.    Plackett, R.L., “The discovery of the method of least squares”, 1972, Biometrika, 59, 239–251.
  6. 6.    Seal, H.L., “The historical development of the Gauss linear model”. 1967, Biometrika, 54,1–23
  7. 7.    Kennedy, P., “Ballentine: A Graphical Aid for Econometrics,” Australian Economics papers, vol 20, 1981, pp414-416




Wednesday, 13 April 2016

Internet of Things (IoT) is catching up!!

Internet of Things (IoT) is catching up!!

According to latest research report by Gartner, Inc, there will be nearly 26 billion devices on Internet by 2020. Gartner, Inc. forecasts that 6.4 billion connected things will be in use worldwide in 2016, and will reach 20.8 billion by 2020[1]. They also estimate that the Internet of Things (IoT) will support total services spending of $235 billion in 2016. According to ABI research, the revenues from integrating, storing, analyzing, and presenting IoT data will reach USD 5.7 billion by 2016. From FitBits to Apple Watches, wearable tech and IoT will explode into the marketplace soon[2].

Yet another estimate by Business Insider, 2016, reports that there will be 34 billion devices connected to the internet by 2020. IoT devices will account for 24 billion, while traditional computing devices (e.g. smartphones, tablets, smartwatches, etc.) will comprise 10 billion. Nearly $6 trillion will be spent on IoT solutions over the next five years[3].

Internet is network of networks, and a network comprises of connected devices and hence Internet is of Internet through tiny embedded sensors and computing power. Internet of Things, Internet of Everything (by CISCO), Smart Things (by IBM) are the terminologies coined by different companies for same internet of things - many connected devices. Because of the popularity and advancement in semiconductors, web, wireless, mobile and security technologies, anyTHING and everyTHING can be connected on the Internet. The THING on the INTERNET can be your refrigerator, your house security system, surveillance camera of your child’s play home or your remote office, RFID systems or any everyday THINGs. Hence the name IOT - The Internet of Things (IoT) is nothing but the phenomenon of every conceivable device getting connected to the internet.

The idea of IOT is that not only that your computer and your smartphone can talk to each other on the Internet, but also to all the things that are around you which can be on the Internet. From your connected homes to connected refrigerators, cars, trains and roads to devices that can track an individual’s movement and even your body pulse rates, heart beats, the calories you burn, your locations, etc. These data can be “pushed” to numerous BIG DATA applications to solve everyday problems that potentially lead to a new and improved customer experience.

The THINGS on the internet can be controlled, observed, or analysed by other smart devices on the internet, if programmed properly. IOT environment comprises of smart devices which are Always connected to each other Anywhere at Anytime (3As). These devices can be configured to constantly send their data to a cloud server for further analysis that can help in making decisions and business actions. The real value is in the analysis of the data and particularly how this analysis can lead to predicting the future.

What makes a THING to be on the Internet?

1. Should be an IP device (IPV6 or IPV4).

2. Should have a unique identification to start communicating.

3. Should be able to send configuration, events, and sensory data over the Internet.

4. Should be able to present its identity to advanced front end mobile applications or web applications so that the applications can extract data from these devices. The data analysis will then help in making automatic decisions of controlling and configuring these devices.

IOT can play a major role in the utilities, oil & gas, manufacturing, transportation, retail and other business sectors as the THIINGS that are necessary for these businesses can send enough data to a remote server for arriving at informed decisions. The data sent by the IOT, can be used to improve the asset utilization, for improved tracking of the devices, and for providing real-time insights.

Imagine, you have a THING inside your car which could sense the temperature, knows your schedule and starts the car engine automatically 10 minutes before you leave office in Chicago winter time. How about controlling your home thermostat and security system while you are away from home? Pretty soon, all the home appliances will be Wi-Fi and IOT enabled and let you make informed decisions about when to adjust your thermostat, or when the food in the fridge has to be thrown out or when you have to pick up your milk, or when your FEDEX package will reach your home so that you can sign and collect the parcel.

Though IoT has many benefits, it also has several challenges. The main challenges include – Signaling, Security, Power consumption and Bandwidth.

References:
1.  “Gartner Says 6.4 Billion Connected "Things" Will Be in Use in 2016, Up 30 Percent From 2015”, STAMFORD, Conn., November 10, 2015, Last visited on March 15th, 2016. http://www.gartner.com/newsroom/id/3165317

2.  Market for IoT Analytics to Reach US$5.7 Billion in 2015, with Startups Driving the Innovation, London, United Kingdom - 13 Jan 2015, Last visited March 16th 2016, https://www.abiresearch.com/press/market-for-iot-analytics-to-reach-us57-billion-in-/

3.  Here are IoT trends that will change the way businesses, governments, and consumers interact with the world, John Greenough and Jonathan Camhi, Mar. 10, 2016, 9:19 AM, http://www.businessinsider.com/top-internet-of-things-trends-2016-1?IR=T