(Go: >> BACK << -|- >> HOME <<)

SlideShare a Scribd company logo
Presented by: Derek Kane
 Introduction to Regression Analysis
 Ordinary Least Squares
 Assumptions
 Detecting Violations
 Interaction Terms
 Log-Level & Log-Log Transformations
 ANOVA
 Practical Example
 Real Estate
 Supermarket Marketing
 Regression Analysis is the art and science of fitting
straight lines to patterns of data.
 Regression analysis is widely used for prediction
and forecasting, where its use has substantial
overlap with the field of machine learning.
 In a linear regression model, the variable of
interest is (dependent variable) is predicted from a
single or multiple group of variables (independent
variable) using a linear mathematical formula.
 Regression analysis is also used to understand
which among the independent variables are
related to the dependent variable, and to explore
the forms of these relationships.
History:
 The earliest form of regression was the method of
least squares, which was published by a French
mathematician Adrien-Marie Legendre in 1805
and by German mathematician Gauss in 1809.
 Legendre and Gauss both applied the method to
the problem of determining, from astronomical
observations, the orbits of bodies about the Sun
(mostly comets, but also later the then newly
discovered minor planets).
 In the 1950s and 1960s, economists used
electromechanical desk calculators to calculate
regressions. Before 1970, it sometimes took up to
24 hours to receive the result from one
regression.
The first published picture of a regression
line by Francis Galton in 1877
 Many techniques for carrying out regression
analysis have been developed.
 Familiar methods such as linear regression
and ordinary least squares regression are
parametric, in that the regression function is
defined in terms of a finite number of
unknown parameters that are estimated
from the data.
 Non-parametric regression refers to
techniques that allow the regression function
to lie in a specified set of functions, which
may be infinite-dimensional.
 Our focus will be on ordinary least squares
regression and parametric methods.
Regression analysis may be used for a wide variety of business applications, such as:
 Measuring the impact on a corporation's profits of an increase in profits.
 Understanding how sensitive a corporation's sales are to changes in advertising
expenditures.
 Seeing how a stock price is affected by changes in interest rates.
 Calculating price elasticity for goods and services.
 Litigation and information discovery.
 Total Quality Control Analyses.
 Human Resource and talent evaluation.
 Regression analysis may also be used for forecasting purposes; for example, a regression
equation may be used to forecast the future demand for a company's products.
Simple Linear Regression Formula
 The simple regression model can be represented as follows:
Y = β₀ + β₁X1 + ε
 The β₀ represents the Y intercept value, the coefficient β₁ represents the slope of the line,
the X1 is an independent variable and ε is the error term. The error term is the value
needed to correct for a prediction error between the observed and predicted value.
Dependent
Variable
Intercept
Error Term
Independent
Variable
Coefficient
Simple Linear Regression Formula
 The output of a regression analysis will produce a coefficient table similar to the one
below.
 This table shows that the intercept is -114.326 and the Height coefficient is 106.505 +/-
11.55.
 This can be interpreted as for each unit increase in X, we can expect that Y will increase
by 106.5
 Also, the T value and Pr > |t| indicate that these variables are statistically significant at the
0.05 level and can be included in the model.
Multiple Linear Regression Formula
 A multiple linear regression is essentially the same as a simple linear regression except
that there can be multiple coefficients and independent variables.
Y = β₀ + β₁X1 +β2X2 +…+ ε
 The interpretation of the coefficient is slightly different than in a simple linear regression.
Using the table below the interpretation can be thought of:
 For each 1 unit change in Width, increases Y by 94.56. This is while holding all other
coefficients constant.
What is Ordinary Least Squares or OLS?
 In statistics, ordinary least squares
(OLS) or linear least squares is a
method for estimating the unknown
parameters in a linear regression
model.
 The goal of OLS is to minimize the
differences between the observed
responses in some arbitrary dataset
and the responses predicted by the
linear approximation of the data.
 Visually this is seen as the sum of the
vertical distances between each data
point in the set and the
corresponding point on the
regression line.
 The smaller the differences (square
size), the better the model fits the
data.
There are a number of classical assumptions which must hold true if we are to trust the
outcome of the regression model.
 The sample is representative of the population for the inference prediction.
 The error is a random variable with a mean of zero conditional on the
independent variables.
 The independent variables are measured with no error.
 The predictors are linearly independent, i.e. it is not possible to express any
predictor as a linear combination of the others.
 The errors are uncorrelated, that is, the variance–covariance matrix of the errors is
diagonal and each non-zero element is the variance of the error.
 The variance of the error is constant across observations (homoscedasticity).
Consequences of using an invalid modeling
procedure include:
 The consequences have a tremendous impact
on the theory that formed the basis of
investigating this aspect of human nature.
 A lack of linear association between
independent and dependent variables, model
misspecification, etc… insinuates that you have
the wrong theory.
 Biased, inefficient coefficients due to poor
reliability, collinearity, etc… lead to an incorrect
interpretation regarding your theory.
 Outliers imply that you are not able to apply
your theory to the entire population that you
drew your data from.
 Over fitting implies that you are overconfident
with your theory.
There are a number of statistics and diagnostic tests
we can draw from to evaluate linear regression
models beyond EDA.
 Coefficient of Determination
 Residual Plot
 Breusch-Pagan or White Test
 Variance Inflation Factor
 Influential Observations
 Leverage Points
 Cook’s Distance
 Etc…
R2 : Coefficient of Determination
 This is a measure of the goodness of fit for a
linear regression model.
 It is the percentage of the dependent variable
variation that is explained by a linear model
 R2 = Explained variation / Total variation
 R2 is always between 0 and 100%:
 0% indicates that the model explains none of
the variability of the dependent data around
its mean.
 100% indicates that the model explains all the
variability of the dependent data around its
mean.
Are Low R2 Values Inherently Bad?
 No!
There are two major reasons why it can be just fine to have low R-squared values.
 In some fields, it is entirely expected that your R-squared values will be low. For
example, any field that attempts to predict human behavior, such as psychology,
typically has R-squared values lower than 50%. Humans are simply harder to predict
than, say, physical processes.
 Furthermore, if your R-squared value is low but you have statistically significant
predictors, you can still draw important conclusions about how changes in the
predictor values are associated with changes in the response value.
 Regardless of the R-squared, the significant coefficients still represent the mean
change in the response for one unit of change in the predictor while holding other
predictors in the model constant. Obviously, this type of information can be
extremely valuable.
 The number of independent variables in your model will increase the value of R-squared
despite whether the variables offer an increase in explanatory power. To combat this
issue, we should focus on utilizing the Adjusted R-Squared metric which penalizes a
model for having too many variables.
 There is no generally accepted technique for relating the number of total observations to
the number of independent variables in a model. One possible rule of thumb suggested
by Good and Hardon is:
 Where N is the sample size, n is the number of independent variables, and m is the
number of observations needed to reach the desired precision if the model had only one
variable.
 For example, if the dataset contained 1000 observations and the researcher decided that
5 observations are needed to support a single variable, then the maximum number of
independent variables the model can support is 4.
Key Limitations of R2
 R-squared cannot determine whether the
coefficient estimates and predictions are
biased, which is why you must assess the
residual plots.
 R-squared does not indicate whether a
regression model is adequate. You can
have a low R-squared value for a good
model, or a high R-squared value for a
model that does not fit the data!
 The R-squared in your output is a biased
estimate of the population R-squared.
 A residual plot is a scatterplot of the
residuals (difference between the actual
and predicted value) against the
predicted value.
 A proper model will exhibit a random
pattern for the spread of the residuals
with no discernable shape.
 Residual plots are used extensively in
linear regression analysis for diagnostics
and assumption testing.
 If the residuals form a curvature like
shape, then we know that a
transformation will be necessary and can
explore some methods like the Box-Cox.
Random Residuals
Curved Residuals
 Linear Regression Analysis using OLS contains
an assumption that residuals are identically
distributed across every X variable.
 When this condition holds, the error terms are
homoskedastic, which means the errors have
the same scatter regardless of the value of X.
 When the scatter of the errors is different,
varying depending on the value of one or more
of the independent variables, the error terms
are heteroskedastic.
 A review of a scatterplot of the studentized
residuals against the dependent variable can be
used to detect if heteroskedasticity is present.
The residuals will appear to fan outward in a
distinct pattern.
Heteroskedasticity Pattern
 Heteroskedasticity has serious consequences for the OLS estimator. Although the OLS
estimator remains unbiased, the estimated SE is wrong. Because of this, confidence
intervals and hypotheses tests cannot be relied on.
 The Breusch-Pagan test (alt. White Test) is a method that can be employed to identify
whether or not the error variances are all equal versus the alternative that the error
variances are a multiplicative function of one or more variables.
 The results of this test show that the Chi-Square value is fairly low indicating that
heteroskedasticity is probably not a problem.
Techniques to correct heteroskedasticity:
 Re-specify the model. (Include omitted variables)
 Transform the variables.
 Use Weighted Least Squares in place of OLS.
What is Multicollinearity?
 Collinearity (or multicollinearity) is the undesirable
situation where the correlations among the
independent variables are strong.
 In some cases, multiple regression results may
seem paradoxical. For instance, the model may fit
the data well (high F-Test), even though none of
the X variables has a statistically significant impact
on explaining Y.
 How is this possible? When two X variables are
highly correlated, they both convey essentially the
same information. When this happens, the X
variables are collinear and the results show
multicollinearity.
Why is Multicollinearity a Problem?
 Multicollinearity misleadingly inflates the
standard errors of the coefficients.
 Thus, it makes some variables
statistically insignificant while they
should be otherwise significant.
 It is like two or more people singing
loudly at the same time. One cannot
discern which is which. They offset each
other.
How to detect Multicollinearity?
 Formally, variance inflation factors (VIF) measure
how much the variance of the estimated
coefficients are increased over the case of no
correlation among the X variables. If no two X
variables are correlated, then all the VIFs will be 1.
 If VIF for one of the variables is around or greater
than 5, there is collinearity associated with that
variable.
 The easy solution is: If there are two or more
variables that will have a VIF around or greater
than 5, one of these variables must be removed
from the regression model. To determine the best
one to remove, remove each one individually.
Select the regression equation that explains the
most variance (R2 the highest).
VIF > 5,
Collinearity
Present
 Cook’s distance or Cook’s D is a commonly
used estimate of the influence of a data
point when performing OLS regression.
 Studentized residuals is a quotient
resulting from the division of a residual by
an estimate of its standard deviation.
 The Bonferroni method is a simple method
that allows many comparison statements
to be made (or confidence intervals to be
constructed) while still assuring an overall
confidence coefficient is maintained.
 The hat values represent the predicted Y
values plotted against the actual Y values.
 Adding interaction terms to a regression model can
greatly expand understanding of the relationships
among the variables in the model and allows more
hypotheses to be tested.
Height = 42 + 2.3 * Bacteria + 11 * Sun
 Height = The height of a shrub. (cm)
 Bacteria = the amount of bacteria in the soil. (1000
per/ml)
 Sun = whether the shrub is located in partial or full
sun. (Sun = 0 partial and Sun = 1 is full)
 It would be useful to add an interaction term to the
model if we wanted to test the hypothesis that the
relationship between the amount of bacteria in the
soil on the height of the shrub was different in full
sun than in partial sun.
 One possibility is that in full sun, plants with more
bacteria in the soil tend to be taller, whereas in
partial sun, plants with more bacteria in the soil are
shorter.
 Another possibility is that plants with more bacteria
in the soil tend to be taller in both full and partial
sun, but that the relationship is much more
dramatic in full than in partial sun.
 The presence of a significant interaction
indicates that the effect of one predictor
variable on the response variable is different at
different values of the other predictor variable.
 It is tested by adding a term to the model in
which the two predictor variables are
multiplied.
The regression equation will look like this:
Height = B0 + B1 * Bacteria + B2 * Sun + B3 * (Bacteria * Sun)
 Adding an interaction term to a model
drastically changes the interpretation of all of
the coefficients.
 If there were no interaction term, B1 would
be interpreted as the unique effect of
Bacteria on Height. But the interaction
means that the effect of Bacteria on Height
is different for different values of Sun.
 So the unique effect of Bacteria on Height is
not limited to B1, but also depends on the
values of B3 and Sun.
 The unique effect of Bacteria is represented
by everything that is multiplied by Bacteria in
the model: B1 + B3*Sun. B1 is now
interpreted as the unique effect of Bacteria
on Height only when Sun = 0.
Height = B0 + B1 * Bacteria + B2 * Sun + B3 * (Bacteria * Sun)
 In our example, once we add the interaction
term, our model looks like:
 Height = 35 + 4.2*Bacteria + 9*Sun +
3.2*Bacteria*Sun
 Adding the interaction term changed the
values of B1 and B2.
 The effect of Bacteria on Height is now 4.2 +
3.2*Sun. For plants in partial sun, Sun = 0, so
the effect of Bacteria is 4.2 + 3.2 * 0 = 4.2.
 So for two plants in partial sun, a plant with
1000 more bacteria/ml in the soil would be
expected to be 4.2 cm taller than a plant
with less bacteria.
 For plants in full sun, however, the effect
of Bacteria is 4.2 + 3.2*1 = 7.4.
 So for two plants in full sun, a plant with
1000 more bacteria/ml in the soil would
be expected to be 7.4 cm taller than a
plant with less bacteria.
 Because of the interaction, the effect of
having more bacteria in the soil is
different if a plant is in full or partial sun.
 Another way of saying this is that the
slopes of the regression lines between
height and bacteria count are different for
the different categories of sun. B3
indicates how different those slopes are.
Height = 35 + 4.2 * Bacteria + 9 * Sun + 3.2 * (Bacteria*Sun)
 Interpreting B2 is more difficult.
 B2 is the effect of Sun when Bacteria = 0. Since
Bacteria is a continuous variable, it is unlikely that
it equals 0 often, if ever, so B2 can be virtually
meaningless by itself.
 Instead, it is more useful to understand the effect
of Sun, but again, this can be difficult.
 The effect of Sun is B2 + B3*Bacteria, which is
different at every one of the infinite values of
Bacteria.
 For that reason, often the only way to get an
intuitive understanding of the effect of Sun is to
plug a few values of Bacteria into the equation to
see how Height, the response variable, changes.
Height = 35 + 4.2 * Bacteria + 9 * Sun + 3.2 * (Bacteria*Sun)
 The presentation so far has only considered the following form of a linear regression
equation:
 This is also considered a “level-level” specification because the raw values of y are being
regressed against raw values of x
How do we interpret β1?
 We differentiate X1 to find the marginal effect of x on y. In this case, β is the marginal
effect.
 A log-level Regression specification:
 This is called a “log-level” specification because the natural log transformed values of Y
are being regressed on the raw values of x.
 You might want to run this specification if you think that increases in x lead to a constant
percentage increase in y.
 Ex. Wage on Education? Forest Lumber Volume on Years?
How do we interpret β1?
 First solve for y:
 Then differentiate to get the marginal effect.
 So the marginal effect depends on the value of y, with β itself represents the growth rate.
 For example, if we estimate that β1 is 0.04, we should say that for another year increases
the volume of lumber by 4%.
 A log-log Regression specification:
 This is called a “log-log” specification because the natural log transformed values of Y are
being regressed on the log transformed values of x.
 You might want to run this specification if you think that percentage increases in x lead to
a constant percentage changes in y. Ex. Constant Demand Elasticity
 To calculate marginal effects. Solve for y…
 And differentiate x
 From the previous slide the marginal effect is:
 Solving for β1 we get:
 This makes β1 an elasticity. If x1 is a price and y is a demand and we estimate β1 = -0.6, it
means that a 1% increase in the price of a good would lead to a 6% decrease in demand.
 Analysis of the variance or ANOVA is used to
compare differences of means among more
than 2 groups.
 It does this by looking at variation in the data
and where that variation is found (hence its
name).
 Specifically, ANOVA compares the amount of
variation between groups with the amount of
variation within groups.
 It can be used for both observational and
experimental studies.
 When we take samples from a population, we expect each sample mean to differ
simply because we are taking a sample rather than measuring the whole
population; this is called sampling error but is often referred to more informally as
the effects of “chance”.
 Thus, we always expect there to be some differences in means among different
groups.
 The question is: is the difference among groups greater than that expected to be
caused by chance? In other words, is there likely to be a true (real) difference in
the population mean?
The ANOVA model
 Mathematically, ANOVA can be written as:
xij = μi + εij
 where x are the individual data points (i and j
denote the group and the individual
observation), ε is the unexplained variation and
the parameters of the model (μ) are the
population means of each group. Thus, each
data point (xij) is its group mean plus error.
Assumptions of ANOVA
 The response is normally distributed
 Variance is similar within different groups
 The data points are independent
Hypothesis testing
 Like other classical statistical tests, we use ANOVA to calculate a test statistic (the F-ratio)
with which we can obtain the probability (the P-value) of obtaining the data assuming the
null hypothesis.
 Null hypothesis: all population means are equal
 Alternative hypothesis: at least one population mean is different from the rest.
 A significant P-value (usually taken as P<0.05) suggests that at least one group mean is
significantly different from the others. In other words, a variable with p<0.05 allows for us
to consider including the variable within a predictive model.
 ANOVA separates the variation in the dataset into 2 parts: between-group and within-
group. These variations are called the sums of squares, which can be seen in the following
slides.
Calculation of the F ratio
Step 1) Variation between groups
 The between-group variation (or between-group sums of squares, SS) is calculated by
comparing the mean of each group with the overall mean of the data.
 Specifically, this is:
𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝑆𝑆 = 𝑛1 𝑥1 − 𝑥 2 + 𝑛2 𝑥2 − 𝑥 2 + 𝑛3 𝑥3 − 𝑥 2
 We then divide the BSS by the number of degrees of freedom [this is like sample size,
except it is n-1, because the deviations must sum to zero, and once you know n-1, the
last one is also known] to get our estimate of the mean variation between groups.
Step 2) Variation within groups
 The within-group variation (or the within-group sums of squares) is the variation of each
observation from its group mean.
𝑆𝑆 𝑟 = 𝑠2
𝑔𝑟𝑜𝑢𝑝1 𝑛 𝑔𝑟𝑜𝑢𝑝1 − 1 + 𝑠2
𝑔𝑟𝑜𝑢𝑝2 𝑛 𝑔𝑟𝑜𝑢𝑝2 − 1 + 𝑠2
𝑔𝑟𝑜𝑢𝑝3 𝑛 𝑔𝑟𝑜𝑢𝑝3 − 1
 i.e., by adding up the variance of each group times by the degrees of freedom of each
group. Note, you might also come across the total SS (sum of ). Within SS is then Total SS
minus Between SS.
 As before, we then divide by the total degrees of freedom to get the mean variation
within groups.
Step 3) The F ratio
 The F ratio is then calculated as:
𝐹 𝑅𝑎𝑡𝑖𝑜 =
𝑀𝑒𝑎𝑛 𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝 𝑆𝑆
𝑀𝑒𝑎𝑛 𝑊𝑖𝑡ℎ𝑖𝑛 𝐺𝑟𝑜𝑢𝑝 𝑆𝑆
 If the average difference between groups is similar to that within groups, the F ratio is
about 1. As the average difference between groups becomes greater than that within
groups, the F ratio becomes larger than 1.
 Therefore, variables with higher F Ratio values provide greater explanatory power when
utilized in predictive models.
 To obtain a P-value, it can be tested against the F-distribution of a random variable with
the degrees of freedom associated with the numerator and denominator of the ratio. The
P-value is the probably of getting that F ratio or a greater one. Larger F-ratios gives
smaller P-values.
 Do you prefer ketchup or soy sauce?
 If someone asked you this question, your answer would likely depend upon what you
were eating. You probably wouldn't dunk your spicy tuna roll in ketchup. And most
people (pregnant moms-to-be excluded) don't seem to fancy eating soy sauce with hot
French fries.
OR
A Common Error When Using ANOVA to Assess
Variables
 So you collect data about your variables of interest,
and now you're ready to do your analysis. This is
where many people make the unfortunate mistake of
looking only at each variables individually.
 In addition to considering how each variable impacts
your response variable, you also need to evaluate the
interaction between those variables and determine if
any of those are significant as well.
 And much like your preference for ketchup versus soy
sauce depends upon what you’re eating, optimum
settings for a given variable will depend upon the
settings of another variable when an interaction is
present.
How to Evaluate and Interpret an Interaction
 Let’s use a weight loss example to illustrate how we can evaluate an interaction between
factors. We're evaluating 2 different diets and 2 different exercise programs: one focused
on cardio and one focused on weight training. We want to determine which result in
greater weight loss. We randomly assign participants to either diet A or B and either the
cardio or weight training regimen, and then record the amount of weight they’ve lost
after 1 month.
 Here is a snapshot of the data:
 Example: We are wanting to understand how to explain the WeightLoss variable from the
diet variable.
OR
Observations:
 The F Value is well over 1 indicating that this variable
has some explanatory value for WeightLoss.
 The P-Value is statistically significant at the 0.05 level.
 Let’s look at the ANOVA output for both the Diet and Exercise variables.
Observations:
 The Diet variable has a F Value of 13.69
 The Exercise variable has a F Value of 6.355
 Both variables are statistically significant at the 0.05 level
Within Group
 We can see that the p-value for the Exercise*Diet interaction is 0.000. Because this p-
value is so small, we can conclude that there is indeed a significant interaction between
Exercise and Diet.
 So which diet is better? Our data suggest it’s like asking “ketchup or soy sauce?” The
answer is, "It depends."
 Since the Exercise*Diet interaction is significant, let’s use an interaction plot to take a
closer look:
 For participants using the cardio program (shown in black), we can see that diet A is best
and results in greater weight loss. However, if you’re following the weight training
regimen (shown in red), then diet B is results in greater weight loss than A.
 Suppose this interaction wasn't on our radar, and we instead focused only on the
individual main effects and their impact on weight loss:
 Based on this plot, we would incorrectly conclude that diet A is better than B. As we saw
from the interaction plot, that is only true IF we’re looking at the cardio group.
 Clearly, we always need to evaluate interactions when analyzing multiple factors. If you
don't, you run the risk of drawing incorrect conclusions...and you might just get ketchup
with your sushi roll.
 ANOVA can also be used as a means to compare two linear regression models using the
Chi-square measure.
 Here are two regression models we want to compare to each other. The order here is
important so make sure you are applying the correct selection of the models.
 Model 1: y = a
 Model 2: y = b
 The p-value of the test is 0.82. It means that the fitted model “Model 1" is not significantly
different from Model 2 at the level of α=0.05 . Note that this test makes sense only if
Model 1 and Model 2 are nested models. (i.e. it tests whether reduction in the residual
sum of squares are statistically significant or not).
 Linear regression is used to analyze
continuous relationships; however,
regression is essentially the same as
ANOVA.
 In ANOVA, we calculate means and
deviations of our data from the means.
 In linear regression, we calculate the best
line through the data and calculate the
deviations of the data from this line.
 The F ratio can be calculated in both.
Data Science - Part IV - Regression Analysis & ANOVA
 The dynamics and rapid change of the real estate
market require business decision makers to seek
advanced analytical solutions to maintain a
competitive edge.
 Real estate pricing and home valuation can
greatly benefit from predictive modeling
techniques, in particular, linear regression.
 The dataset we will be working with reviews home
values in Boston, Massachusetts and compiles a
number of other statistics to help aid in
determining property value.
 The goal for this exercise will be to provide a
predictive model that can be leveraged to help
real estate businesses in the Boston market.
 Here is a description of the variables within the dataset:
 Our goal is to develop a multiple linear regression model for the median value of a home in
Boston based upon the other variables.
 First, lets take a look at the raw data in the table.
 With so many potential independent variables, we first need to reduce the field of variables
to those which can help explain the model.
 A review of the correlation matrix indicates
that there are a number of variables which we
can use when building a model.
 Based upon the correlations, we will initially
focus on utilizing the following variables:
indus, rm, tax, ptratio, and lstat.
 Afterwards, we will assess quality of the
models performance and utilize an alternative
model approach.
 A tertiary review of the models output shows a
couple of potential issues with the model.
 Despite having a correlation of -0.484 to the
median value, the indus variable is not statistically
significant (0.2802) and should be dropped from
the model. The tax variable is also statistically
insignificant and should be removed from the
model.
 The Adjusted R-squared is 0.6772, which indicates
a reasonable goodness of fit and 67.7% of the
variation in house prices can be explained by the
five variables.
 There are some who would argue, based off of
industry experience, that we could leave the
model as is and the model performs reasonably
well. Nevertheless, we will rebuild this model and
improve its performance.
 Lets double check that the dependent variable is normally distributed. It appears that the dataset
is left skewed and could benefit from a log transformation.
Log
Transformation
 Lets utilize an automated feature
selection procedure called stepwise
selection to identify those variables
which are both statistically significant and
add value to the regression model.
 This revised model now has all variables
showing statistical significance at the
p<0.05 level.
 Additionally, the model now has an
Adjusted R-Square of 73.4% compared
to 67.7% which is a notable
improvement.
 We checked the VIF to determine
whether multicollinearity is an issue.
All of the values are below 3 which
indicates that this is not an issue.
 A review of the QQ Plot indicates that
the data generally agrees with a
normal distribution, however, there are
longer tails at the ends of the
distribution.
 A review of the residual plots indicates
the potential need to apply some
transformation of the independent
variable to further improve the model.
QQ Plot
Residual Plot
 When we performed a log-level
transformation of the data, we now
must interpret the change in x as a
constant percentage increase in y.
Interpretation:
 Therefore, each additional room a
house has leads to an increase of the
price by 12%, holding other variables
constant.
 If the home is near the river, the price
increases by 13%, holding other
variables constant.
 When a house is close to the main
employment centers, the price
decreases by 3% per unit.
Here is the final model that we had produced:
Data Science - Part IV - Regression Analysis & ANOVA
 A supermarket is selling a new type of grape juice
in some of its stores for pilot testing.
 The senior management wants to understand the
relationship between the grape juice and its
impact on apple juice, cookie sales, and
profitability.
 We will showcase how it is possible to build off of
linear OLS regression models and econometric
methodologies to solve a series of advanced
business problems.
 The goal will be to provide tangible
recommendations from our analyses to help this
business manage their portfolio.
Our goal is to setup an experiments to analyze:
 Which type of in-store advertisement is more effective? The marketing group has placed two
types of ads in stores for testing, one theme is natural production of the juice, the other
theme is family health caring.
 The Price Elasticity – the reactions of sales quantity of the grape juice to its price change.
 The Cross-Price Elasticity – the reactions of sales quantity of the grape juice to the price
changes of other products such as apple juice and cookies in the same store.
 How to find the best unit price of the grape juice which can maximize the profit and the
forecast of sales with that price?
 Here is a description of the variables within the dataset:
 First, lets take a look at the raw data in the table.
 From the summary table, we can roughly know the basic statistics of each numeric variable.
For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is
335.
 We can further explore the distribution of the data of sales by visualizing the data in graphical
form as follows. We don’t find outliers in the above box plot graph and the sales data
distribution is roughly normal. It is not necessary to apply further data cleaning and treatment
to the data set.
 The marketing team wants to find out the ad
with better effectiveness for sales between the
two types of ads:
 a natural production theme
 with family health caring theme.
 To find out the better ad, we can calculate and
compare the mean of sales with the two
different ad types at the first step.
 The mean of sales with nature product theme is
about 187; the mean of sales with family health
caring theme is about 247.
 It looks like that the latter one is better.
 To find out how likely the conclusion is
correct for the whole population, it is
necessary to do statistical testing – two-
sample t-test.
 We can see that both datasets are
normally distributed and to be certain we
can run the Shapiro-Wilk test.
 The p-values of the Shapiro-Wilk tests
are larger than 0.05, so there is no strong
evidence to reject the null hypothesis
that the two groups of sales data are
normally distributed.
 Now we can conduct the Welch two sample t-test since the t-test assumptions are met.
From the output of t-test above, we can say that:
 We have strong evidence to say that the population means of the sales with the two
different ad types are different because the p-value of the t-test is very small;
 With 95% confidence, we can estimate that the mean of the sales with natural
production theme ad is somewhere in 27 to 93 units less than that of the sales with
family health caring theme ad.
 So the conclusion is that the ad with the theme of family health caring is BETTER.
 With the information given in the data set,
we can explore how grape juice price, ad
type, apple juice price, cookies price
influence the sales of grape juice in a store
by multiple linear regression analysis.
 Here, “sales” is the dependent variable and
the others are independent variables.
 Let’s investigate the correlation between
the sales and other variables by displaying
the correlation coefficients in pairs.
 The correlation coefficients between sales
and price, ad type, price apple, and price
cookies are 0.85, 0.58, 0.37, and 0.37
respectively, that means they all might
have some influences to the sales.
 We can try to add all of the independent
variables into the regression model:
 The p-value for Price, Ad Type, and Price Cookies
is much less than 0.05. They are significant in
explaining the sales. We are confident to include
these variables into the model.
 The p-value of Price Apple is a bit larger than
0.05, seems there are no strong evidence for
apple juice price to explain the sales. However,
according to our real-life experience, we know
when apple juice price is lower, consumers likely
to buy more apple juice, and then the sales of
other fruit juice will decrease.
 So we can also add it into the model to explain
the grape juice sales.
 The Adjusted R-squared is 0.881, which indicates
a reasonable goodness of fit and 88% of the
variation in sales can be explained by the four
variables.
 The assumptions for the regression to
be true are that data are random and
independent; residuals are normally
distributed and have constant variance.
Let’s check the residuals assumptions
visually.
 The Residuals vs Fitted graph above
shows that the residuals scatter around
the fitted line with no obvious pattern, and
the Normal Q-Q graph shows that
basically the residuals are normally
distributed. The assumptions are met.
 The VIF test value for each variable is
close to 1, which means the
multicollinearity is very low among these
variables.
 With model established, we can analysis the Price Elasticity(PE) and Cross-price
Elasticity(CPE) to predict the reactions of sales quantity to price.
Price Elasticity
 PE = (ΔQ/Q) / (ΔP/P) = (ΔQ/ΔP) * (P/Q) = -51.24 * 0.045 = -2.3
 P is price, Q is sales quantity
 ΔQ/ΔP = -51.24 , the parameter before the variable “price” in the above model
 P/Q = 9.738 / 216.7 = 0.045
 P is the mean of prices in the dataset, Q is the mean of the Sales variable.
Interpretation: The PE indicates that 10% decrease in grape juice price will increase the
grape juice sales by 23%, and vice versa.
Linear Regression Model:
Let’s further calculate the CPE on apple juice and cookies to analyze the how the change
of apple juice price and cookies price influence the sales of grape juice.
Cross Price Elasticity
 CPEapple = (ΔQ/ΔPapple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78
 CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = – 1.12
Interpretation:
 The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales of
grape juice by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.
 The CPEcookies indicates that 10% decrease in cookies price will INCREASE the grape juice
sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place
the two products together will likely increase the sales for both.
 We can also know that the grape juice sales increase 29.74 units when using the ad with
the family health caring theme (ad_type = 1).
 Usually companies want to get higher profit
rather than just higher sales quantity.
 So, how to set the optimal price for the new
grape juice to get the maximum profit based on
the dataset collected in the pilot period and our
regression model?
 To simplify the question, we can let the Ad Type
= 1, the Price Apple = 7.659 (mean value), and
the Price Cookies = 9.738 (mean value).
The model is simplified as follows:
 Sales = 774.81 – 51.24 * price + 29.74 * 1 + 22.1 *
7.659 – 25.28 * 9.738
 Sales = 772.64 – 51.24*price
 Assume the marginal cost (C) per unit of grape juice is $5.00. We can calculate the profit
(Y) by the following formula:
 Y = (price – C) * Sales Quantity = (price – 5) * (772.64 – 51.24*price)
 Y = – 51.24 * price2 + 1028.84 * price – 3863.2
 To get the optimal price to maximize Y, we can use the following R function.
 The optimal price is $10.04; the maximum profit will be $1301 according to the above
output. In reality, we can reasonably set the price to be $10.00 or $9.99.
 We can further use the model to predict the sales while the price is $10.00.
 Additionally, the ad type = 1
 Mean Price Apple = 7.659
 Mean Price Cookies = 9.738
Sales = 774.81 – 51.24 * Price + 29.74 * Ad Type + 22.08 * Price Apple – 25.27 * Price Cookies
Sales = 774.81 – 51.24 * (10) + 29.74 * (1)+ 22.08 * (7.659) – 25.27 *(9.738)
 The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence
in a store in one week on average.
 Based on the forecast and other factors, the supermarket can prepare the inventory for all
of its stores after the pilot period.
Linear Regression Model:
 Reside in Wayne, Illinois
 Active Semi-Professional Classical Musician
(Bassoon).
 Married my wife on 10/10/10 and been
together for 10 years.
 Pet Yorkshire Terrier / Toy Poodle named
Brunzie.
 Pet Maine Coons’ named Maximus Power and
Nemesis Gul du Cat.
 Enjoy Cooking, Hiking, Cycling, Kayaking, and
Astronomy.
 Self proclaimed Data Nerd and Technology
Lover.
Data Science - Part IV - Regression Analysis & ANOVA
 http://en.wikipedia.org/wiki/Regression_analysis
 http://www.ftpress.com/articles/article.aspx?p=2133374
 http://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf
 http://www.theanalysisfactor.com/interpreting-interactions-in-regression/
 http://www.edanzediting.com/blog/statistics_anova_explained#.VIdeEo0tBSM
 http://en.wikipedia.org/wiki/Ordinary_least_squares
 http://www.unt.edu/rss/class/mike/6810/OLS%20Regression%20Summary.pdf
 http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-
interpret-r-squared-and-assess-the-goodness-of-fit
 http://www.chsbs.cmich.edu/fattah/courses/empirical/multicollinearity.html
 http://home.wlu.edu/~gusej/econ398/notes/logRegressions.pdf
 http://www.dataapple.net/?p=19

More Related Content

What's hot

Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
Babasab Patil
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
Carlo Magno
 
Univariate Analysis
Univariate AnalysisUnivariate Analysis
Univariate Analysis
christineshearer
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
Dalia El-Shafei
 
Logistic regression sage
Logistic regression sageLogistic regression sage
Logistic regression sage
Pakistan Gum Industries Pvt. Ltd
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
Hassan Hussein
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
University of Jaffna
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
T.S. Lim
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
Nima
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
Dan Wellisch
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
alok tiwari
 
Regression Analysis Research Presentation
Regression Analysis Research PresentationRegression Analysis Research Presentation
Regression Analysis Research Presentation
DianaWilbur
 
Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)
Naveen Kumar Medapalli
 
Quantitative Data analysis
Quantitative Data analysisQuantitative Data analysis
Quantitative Data analysis
Muhammad Musawar Ali
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis ppt
Mukesh Bisht
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysis
Setia Pramana
 
Non Linear Equation
Non Linear EquationNon Linear Equation
Non Linear Equation
MdAlAmin187
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
DrZahid Khan
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
Venkata Reddy Konasani
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
DrZahid Khan
 

What's hot (20)

Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
 
Univariate Analysis
Univariate AnalysisUnivariate Analysis
Univariate Analysis
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
 
Logistic regression sage
Logistic regression sageLogistic regression sage
Logistic regression sage
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
 
Regression Analysis Research Presentation
Regression Analysis Research PresentationRegression Analysis Research Presentation
Regression Analysis Research Presentation
 
Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)
 
Quantitative Data analysis
Quantitative Data analysisQuantitative Data analysis
Quantitative Data analysis
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis ppt
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysis
 
Non Linear Equation
Non Linear EquationNon Linear Equation
Non Linear Equation
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 

Similar to Data Science - Part IV - Regression Analysis & ANOVA

ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.ppt
Ergin Akalpler
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managers
recepmaz
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managers
recepmaz
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
linearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxlinearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptx
bishalnandi2
 
CH3.pdf
CH3.pdfCH3.pdf
Linear regression
Linear regressionLinear regression
Linear regression
Learnbay Datascience
 
Simple Regression.pptx
Simple Regression.pptxSimple Regression.pptx
Simple Regression.pptx
Victoria Bozhenko
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
Antoine De Henau
 
7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares
Yugesh Dutt Panday
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
Kishor Ade
 
Ders 2 ols .ppt
Ders 2 ols .pptDers 2 ols .ppt
Ders 2 ols .ppt
Ergin Akalpler
 
linear model multiple predictors.pdf
linear model multiple predictors.pdflinear model multiple predictors.pdf
linear model multiple predictors.pdf
ssuser7d5314
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
Rafael Bustamante Romaní
 
Exploring bivariate data
Exploring bivariate dataExploring bivariate data
Exploring bivariate data
Ulster BOCES
 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
budbarber38650
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
chapter_7.pptx
chapter_7.pptxchapter_7.pptx
chapter_7.pptx
ABDULAI3
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
Rithish Kumar
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Derek Kane
 

Similar to Data Science - Part IV - Regression Analysis & ANOVA (20)

ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.ppt
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managers
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managers
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
linearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptxlinearregression-1909240jhgg53948.pptx
linearregression-1909240jhgg53948.pptx
 
CH3.pdf
CH3.pdfCH3.pdf
CH3.pdf
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Simple Regression.pptx
Simple Regression.pptxSimple Regression.pptx
Simple Regression.pptx
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
 
Ders 2 ols .ppt
Ders 2 ols .pptDers 2 ols .ppt
Ders 2 ols .ppt
 
linear model multiple predictors.pdf
linear model multiple predictors.pdflinear model multiple predictors.pdf
linear model multiple predictors.pdf
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Exploring bivariate data
Exploring bivariate dataExploring bivariate data
Exploring bivariate data
 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
chapter_7.pptx
chapter_7.pptxchapter_7.pptx
chapter_7.pptx
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 

More from Derek Kane

Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
Derek Kane
 
Data Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier AnalysisData Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier Analysis
Derek Kane
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Derek Kane
 
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsData Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic Algorithms
Derek Kane
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov Models
Derek Kane
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
Data Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingData Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series Forecasting
Derek Kane
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
Derek Kane
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
Derek Kane
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
Derek Kane
 
Data Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation EnginesData Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation Engines
Derek Kane
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
Derek Kane
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
Derek Kane
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Derek Kane
 

More from Derek Kane (15)

Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
Data Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier AnalysisData Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier Analysis
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
 
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsData Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic Algorithms
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov Models
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Data Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingData Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series Forecasting
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Data Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation EnginesData Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation Engines
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 

Recently uploaded

Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
adityaroy0215
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
nehadubay1
 
MRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbe
MRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbeMRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbe
MRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbe
47NehaKJ
 
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
Donghwan Lee
 
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
ritu36392
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
67n7f53
 
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docxNEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
dharugayu13475
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
Miguel Ángel Rodríguez Anticona
 
2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting
Alison Pitt
 
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
Nikita Singh$A17
 
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
seenu pandey
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
punebabes1
 
buku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdfbuku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdf
ABDULKALAM847167
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 

Recently uploaded (20)

Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
MRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbe
MRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbeMRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbe
MRP2 hshsbsbenne.pdfdbbdbsbebenebeneneebbe
 
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
 
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
@Call @Girls in Bangalore 🚒 0000000000 🚒 Tanu Sharma Best High Class Bangalor...
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
 
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docxNEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
 
2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting2024 June - Orange County (CA) Tableau User Group Meeting
2024 June - Orange County (CA) Tableau User Group Meeting
 
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
 
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
 
buku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdfbuku report tentang analisis TIMSS 2023.pdf
buku report tentang analisis TIMSS 2023.pdf
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 

Data Science - Part IV - Regression Analysis & ANOVA

  • 2.  Introduction to Regression Analysis  Ordinary Least Squares  Assumptions  Detecting Violations  Interaction Terms  Log-Level & Log-Log Transformations  ANOVA  Practical Example  Real Estate  Supermarket Marketing
  • 3.  Regression Analysis is the art and science of fitting straight lines to patterns of data.  Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.  In a linear regression model, the variable of interest is (dependent variable) is predicted from a single or multiple group of variables (independent variable) using a linear mathematical formula.  Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships.
  • 4. History:  The earliest form of regression was the method of least squares, which was published by a French mathematician Adrien-Marie Legendre in 1805 and by German mathematician Gauss in 1809.  Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but also later the then newly discovered minor planets).  In the 1950s and 1960s, economists used electromechanical desk calculators to calculate regressions. Before 1970, it sometimes took up to 24 hours to receive the result from one regression. The first published picture of a regression line by Francis Galton in 1877
  • 5.  Many techniques for carrying out regression analysis have been developed.  Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data.  Non-parametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.  Our focus will be on ordinary least squares regression and parametric methods.
  • 6. Regression analysis may be used for a wide variety of business applications, such as:  Measuring the impact on a corporation's profits of an increase in profits.  Understanding how sensitive a corporation's sales are to changes in advertising expenditures.  Seeing how a stock price is affected by changes in interest rates.  Calculating price elasticity for goods and services.  Litigation and information discovery.  Total Quality Control Analyses.  Human Resource and talent evaluation.  Regression analysis may also be used for forecasting purposes; for example, a regression equation may be used to forecast the future demand for a company's products.
  • 7. Simple Linear Regression Formula  The simple regression model can be represented as follows: Y = β₀ + β₁X1 + ε  The β₀ represents the Y intercept value, the coefficient β₁ represents the slope of the line, the X1 is an independent variable and ε is the error term. The error term is the value needed to correct for a prediction error between the observed and predicted value. Dependent Variable Intercept Error Term Independent Variable Coefficient
  • 8. Simple Linear Regression Formula  The output of a regression analysis will produce a coefficient table similar to the one below.  This table shows that the intercept is -114.326 and the Height coefficient is 106.505 +/- 11.55.  This can be interpreted as for each unit increase in X, we can expect that Y will increase by 106.5  Also, the T value and Pr > |t| indicate that these variables are statistically significant at the 0.05 level and can be included in the model.
  • 9. Multiple Linear Regression Formula  A multiple linear regression is essentially the same as a simple linear regression except that there can be multiple coefficients and independent variables. Y = β₀ + β₁X1 +β2X2 +…+ ε  The interpretation of the coefficient is slightly different than in a simple linear regression. Using the table below the interpretation can be thought of:  For each 1 unit change in Width, increases Y by 94.56. This is while holding all other coefficients constant.
  • 10. What is Ordinary Least Squares or OLS?  In statistics, ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model.  The goal of OLS is to minimize the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data.
  • 11.  Visually this is seen as the sum of the vertical distances between each data point in the set and the corresponding point on the regression line.  The smaller the differences (square size), the better the model fits the data.
  • 12. There are a number of classical assumptions which must hold true if we are to trust the outcome of the regression model.  The sample is representative of the population for the inference prediction.  The error is a random variable with a mean of zero conditional on the independent variables.  The independent variables are measured with no error.  The predictors are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others.  The errors are uncorrelated, that is, the variance–covariance matrix of the errors is diagonal and each non-zero element is the variance of the error.  The variance of the error is constant across observations (homoscedasticity).
  • 13. Consequences of using an invalid modeling procedure include:  The consequences have a tremendous impact on the theory that formed the basis of investigating this aspect of human nature.  A lack of linear association between independent and dependent variables, model misspecification, etc… insinuates that you have the wrong theory.  Biased, inefficient coefficients due to poor reliability, collinearity, etc… lead to an incorrect interpretation regarding your theory.  Outliers imply that you are not able to apply your theory to the entire population that you drew your data from.  Over fitting implies that you are overconfident with your theory.
  • 14. There are a number of statistics and diagnostic tests we can draw from to evaluate linear regression models beyond EDA.  Coefficient of Determination  Residual Plot  Breusch-Pagan or White Test  Variance Inflation Factor  Influential Observations  Leverage Points  Cook’s Distance  Etc…
  • 15. R2 : Coefficient of Determination  This is a measure of the goodness of fit for a linear regression model.  It is the percentage of the dependent variable variation that is explained by a linear model  R2 = Explained variation / Total variation  R2 is always between 0 and 100%:  0% indicates that the model explains none of the variability of the dependent data around its mean.  100% indicates that the model explains all the variability of the dependent data around its mean.
  • 16. Are Low R2 Values Inherently Bad?  No! There are two major reasons why it can be just fine to have low R-squared values.  In some fields, it is entirely expected that your R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than, say, physical processes.  Furthermore, if your R-squared value is low but you have statistically significant predictors, you can still draw important conclusions about how changes in the predictor values are associated with changes in the response value.  Regardless of the R-squared, the significant coefficients still represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. Obviously, this type of information can be extremely valuable.
  • 17.  The number of independent variables in your model will increase the value of R-squared despite whether the variables offer an increase in explanatory power. To combat this issue, we should focus on utilizing the Adjusted R-Squared metric which penalizes a model for having too many variables.  There is no generally accepted technique for relating the number of total observations to the number of independent variables in a model. One possible rule of thumb suggested by Good and Hardon is:  Where N is the sample size, n is the number of independent variables, and m is the number of observations needed to reach the desired precision if the model had only one variable.  For example, if the dataset contained 1000 observations and the researcher decided that 5 observations are needed to support a single variable, then the maximum number of independent variables the model can support is 4.
  • 18. Key Limitations of R2  R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.  R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!  The R-squared in your output is a biased estimate of the population R-squared.
  • 19.  A residual plot is a scatterplot of the residuals (difference between the actual and predicted value) against the predicted value.  A proper model will exhibit a random pattern for the spread of the residuals with no discernable shape.  Residual plots are used extensively in linear regression analysis for diagnostics and assumption testing.  If the residuals form a curvature like shape, then we know that a transformation will be necessary and can explore some methods like the Box-Cox. Random Residuals Curved Residuals
  • 20.  Linear Regression Analysis using OLS contains an assumption that residuals are identically distributed across every X variable.  When this condition holds, the error terms are homoskedastic, which means the errors have the same scatter regardless of the value of X.  When the scatter of the errors is different, varying depending on the value of one or more of the independent variables, the error terms are heteroskedastic.  A review of a scatterplot of the studentized residuals against the dependent variable can be used to detect if heteroskedasticity is present. The residuals will appear to fan outward in a distinct pattern. Heteroskedasticity Pattern
  • 21.  Heteroskedasticity has serious consequences for the OLS estimator. Although the OLS estimator remains unbiased, the estimated SE is wrong. Because of this, confidence intervals and hypotheses tests cannot be relied on.  The Breusch-Pagan test (alt. White Test) is a method that can be employed to identify whether or not the error variances are all equal versus the alternative that the error variances are a multiplicative function of one or more variables.  The results of this test show that the Chi-Square value is fairly low indicating that heteroskedasticity is probably not a problem. Techniques to correct heteroskedasticity:  Re-specify the model. (Include omitted variables)  Transform the variables.  Use Weighted Least Squares in place of OLS.
  • 22. What is Multicollinearity?  Collinearity (or multicollinearity) is the undesirable situation where the correlations among the independent variables are strong.  In some cases, multiple regression results may seem paradoxical. For instance, the model may fit the data well (high F-Test), even though none of the X variables has a statistically significant impact on explaining Y.  How is this possible? When two X variables are highly correlated, they both convey essentially the same information. When this happens, the X variables are collinear and the results show multicollinearity.
  • 23. Why is Multicollinearity a Problem?  Multicollinearity misleadingly inflates the standard errors of the coefficients.  Thus, it makes some variables statistically insignificant while they should be otherwise significant.  It is like two or more people singing loudly at the same time. One cannot discern which is which. They offset each other.
  • 24. How to detect Multicollinearity?  Formally, variance inflation factors (VIF) measure how much the variance of the estimated coefficients are increased over the case of no correlation among the X variables. If no two X variables are correlated, then all the VIFs will be 1.  If VIF for one of the variables is around or greater than 5, there is collinearity associated with that variable.  The easy solution is: If there are two or more variables that will have a VIF around or greater than 5, one of these variables must be removed from the regression model. To determine the best one to remove, remove each one individually. Select the regression equation that explains the most variance (R2 the highest). VIF > 5, Collinearity Present
  • 25.  Cook’s distance or Cook’s D is a commonly used estimate of the influence of a data point when performing OLS regression.  Studentized residuals is a quotient resulting from the division of a residual by an estimate of its standard deviation.  The Bonferroni method is a simple method that allows many comparison statements to be made (or confidence intervals to be constructed) while still assuring an overall confidence coefficient is maintained.  The hat values represent the predicted Y values plotted against the actual Y values.
  • 26.  Adding interaction terms to a regression model can greatly expand understanding of the relationships among the variables in the model and allows more hypotheses to be tested. Height = 42 + 2.3 * Bacteria + 11 * Sun  Height = The height of a shrub. (cm)  Bacteria = the amount of bacteria in the soil. (1000 per/ml)  Sun = whether the shrub is located in partial or full sun. (Sun = 0 partial and Sun = 1 is full)
  • 27.  It would be useful to add an interaction term to the model if we wanted to test the hypothesis that the relationship between the amount of bacteria in the soil on the height of the shrub was different in full sun than in partial sun.  One possibility is that in full sun, plants with more bacteria in the soil tend to be taller, whereas in partial sun, plants with more bacteria in the soil are shorter.  Another possibility is that plants with more bacteria in the soil tend to be taller in both full and partial sun, but that the relationship is much more dramatic in full than in partial sun.
  • 28.  The presence of a significant interaction indicates that the effect of one predictor variable on the response variable is different at different values of the other predictor variable.  It is tested by adding a term to the model in which the two predictor variables are multiplied. The regression equation will look like this: Height = B0 + B1 * Bacteria + B2 * Sun + B3 * (Bacteria * Sun)
  • 29.  Adding an interaction term to a model drastically changes the interpretation of all of the coefficients.  If there were no interaction term, B1 would be interpreted as the unique effect of Bacteria on Height. But the interaction means that the effect of Bacteria on Height is different for different values of Sun.  So the unique effect of Bacteria on Height is not limited to B1, but also depends on the values of B3 and Sun.  The unique effect of Bacteria is represented by everything that is multiplied by Bacteria in the model: B1 + B3*Sun. B1 is now interpreted as the unique effect of Bacteria on Height only when Sun = 0. Height = B0 + B1 * Bacteria + B2 * Sun + B3 * (Bacteria * Sun)
  • 30.  In our example, once we add the interaction term, our model looks like:  Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun  Adding the interaction term changed the values of B1 and B2.  The effect of Bacteria on Height is now 4.2 + 3.2*Sun. For plants in partial sun, Sun = 0, so the effect of Bacteria is 4.2 + 3.2 * 0 = 4.2.  So for two plants in partial sun, a plant with 1000 more bacteria/ml in the soil would be expected to be 4.2 cm taller than a plant with less bacteria.
  • 31.  For plants in full sun, however, the effect of Bacteria is 4.2 + 3.2*1 = 7.4.  So for two plants in full sun, a plant with 1000 more bacteria/ml in the soil would be expected to be 7.4 cm taller than a plant with less bacteria.  Because of the interaction, the effect of having more bacteria in the soil is different if a plant is in full or partial sun.  Another way of saying this is that the slopes of the regression lines between height and bacteria count are different for the different categories of sun. B3 indicates how different those slopes are. Height = 35 + 4.2 * Bacteria + 9 * Sun + 3.2 * (Bacteria*Sun)
  • 32.  Interpreting B2 is more difficult.  B2 is the effect of Sun when Bacteria = 0. Since Bacteria is a continuous variable, it is unlikely that it equals 0 often, if ever, so B2 can be virtually meaningless by itself.  Instead, it is more useful to understand the effect of Sun, but again, this can be difficult.  The effect of Sun is B2 + B3*Bacteria, which is different at every one of the infinite values of Bacteria.  For that reason, often the only way to get an intuitive understanding of the effect of Sun is to plug a few values of Bacteria into the equation to see how Height, the response variable, changes. Height = 35 + 4.2 * Bacteria + 9 * Sun + 3.2 * (Bacteria*Sun)
  • 33.  The presentation so far has only considered the following form of a linear regression equation:  This is also considered a “level-level” specification because the raw values of y are being regressed against raw values of x How do we interpret β1?  We differentiate X1 to find the marginal effect of x on y. In this case, β is the marginal effect.
  • 34.  A log-level Regression specification:  This is called a “log-level” specification because the natural log transformed values of Y are being regressed on the raw values of x.  You might want to run this specification if you think that increases in x lead to a constant percentage increase in y.  Ex. Wage on Education? Forest Lumber Volume on Years?
  • 35. How do we interpret β1?  First solve for y:  Then differentiate to get the marginal effect.  So the marginal effect depends on the value of y, with β itself represents the growth rate.  For example, if we estimate that β1 is 0.04, we should say that for another year increases the volume of lumber by 4%.
  • 36.  A log-log Regression specification:  This is called a “log-log” specification because the natural log transformed values of Y are being regressed on the log transformed values of x.  You might want to run this specification if you think that percentage increases in x lead to a constant percentage changes in y. Ex. Constant Demand Elasticity  To calculate marginal effects. Solve for y…  And differentiate x
  • 37.  From the previous slide the marginal effect is:  Solving for β1 we get:  This makes β1 an elasticity. If x1 is a price and y is a demand and we estimate β1 = -0.6, it means that a 1% increase in the price of a good would lead to a 6% decrease in demand.
  • 38.  Analysis of the variance or ANOVA is used to compare differences of means among more than 2 groups.  It does this by looking at variation in the data and where that variation is found (hence its name).  Specifically, ANOVA compares the amount of variation between groups with the amount of variation within groups.  It can be used for both observational and experimental studies.
  • 39.  When we take samples from a population, we expect each sample mean to differ simply because we are taking a sample rather than measuring the whole population; this is called sampling error but is often referred to more informally as the effects of “chance”.  Thus, we always expect there to be some differences in means among different groups.  The question is: is the difference among groups greater than that expected to be caused by chance? In other words, is there likely to be a true (real) difference in the population mean?
  • 40. The ANOVA model  Mathematically, ANOVA can be written as: xij = μi + εij  where x are the individual data points (i and j denote the group and the individual observation), ε is the unexplained variation and the parameters of the model (μ) are the population means of each group. Thus, each data point (xij) is its group mean plus error. Assumptions of ANOVA  The response is normally distributed  Variance is similar within different groups  The data points are independent
  • 41. Hypothesis testing  Like other classical statistical tests, we use ANOVA to calculate a test statistic (the F-ratio) with which we can obtain the probability (the P-value) of obtaining the data assuming the null hypothesis.  Null hypothesis: all population means are equal  Alternative hypothesis: at least one population mean is different from the rest.  A significant P-value (usually taken as P<0.05) suggests that at least one group mean is significantly different from the others. In other words, a variable with p<0.05 allows for us to consider including the variable within a predictive model.  ANOVA separates the variation in the dataset into 2 parts: between-group and within- group. These variations are called the sums of squares, which can be seen in the following slides.
  • 42. Calculation of the F ratio Step 1) Variation between groups  The between-group variation (or between-group sums of squares, SS) is calculated by comparing the mean of each group with the overall mean of the data.  Specifically, this is: 𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝑆𝑆 = 𝑛1 𝑥1 − 𝑥 2 + 𝑛2 𝑥2 − 𝑥 2 + 𝑛3 𝑥3 − 𝑥 2  We then divide the BSS by the number of degrees of freedom [this is like sample size, except it is n-1, because the deviations must sum to zero, and once you know n-1, the last one is also known] to get our estimate of the mean variation between groups.
  • 43. Step 2) Variation within groups  The within-group variation (or the within-group sums of squares) is the variation of each observation from its group mean. 𝑆𝑆 𝑟 = 𝑠2 𝑔𝑟𝑜𝑢𝑝1 𝑛 𝑔𝑟𝑜𝑢𝑝1 − 1 + 𝑠2 𝑔𝑟𝑜𝑢𝑝2 𝑛 𝑔𝑟𝑜𝑢𝑝2 − 1 + 𝑠2 𝑔𝑟𝑜𝑢𝑝3 𝑛 𝑔𝑟𝑜𝑢𝑝3 − 1  i.e., by adding up the variance of each group times by the degrees of freedom of each group. Note, you might also come across the total SS (sum of ). Within SS is then Total SS minus Between SS.  As before, we then divide by the total degrees of freedom to get the mean variation within groups.
  • 44. Step 3) The F ratio  The F ratio is then calculated as: 𝐹 𝑅𝑎𝑡𝑖𝑜 = 𝑀𝑒𝑎𝑛 𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝 𝑆𝑆 𝑀𝑒𝑎𝑛 𝑊𝑖𝑡ℎ𝑖𝑛 𝐺𝑟𝑜𝑢𝑝 𝑆𝑆  If the average difference between groups is similar to that within groups, the F ratio is about 1. As the average difference between groups becomes greater than that within groups, the F ratio becomes larger than 1.  Therefore, variables with higher F Ratio values provide greater explanatory power when utilized in predictive models.  To obtain a P-value, it can be tested against the F-distribution of a random variable with the degrees of freedom associated with the numerator and denominator of the ratio. The P-value is the probably of getting that F ratio or a greater one. Larger F-ratios gives smaller P-values.
  • 45.  Do you prefer ketchup or soy sauce?  If someone asked you this question, your answer would likely depend upon what you were eating. You probably wouldn't dunk your spicy tuna roll in ketchup. And most people (pregnant moms-to-be excluded) don't seem to fancy eating soy sauce with hot French fries. OR
  • 46. A Common Error When Using ANOVA to Assess Variables  So you collect data about your variables of interest, and now you're ready to do your analysis. This is where many people make the unfortunate mistake of looking only at each variables individually.  In addition to considering how each variable impacts your response variable, you also need to evaluate the interaction between those variables and determine if any of those are significant as well.  And much like your preference for ketchup versus soy sauce depends upon what you’re eating, optimum settings for a given variable will depend upon the settings of another variable when an interaction is present.
  • 47. How to Evaluate and Interpret an Interaction  Let’s use a weight loss example to illustrate how we can evaluate an interaction between factors. We're evaluating 2 different diets and 2 different exercise programs: one focused on cardio and one focused on weight training. We want to determine which result in greater weight loss. We randomly assign participants to either diet A or B and either the cardio or weight training regimen, and then record the amount of weight they’ve lost after 1 month.  Here is a snapshot of the data:
  • 48.  Example: We are wanting to understand how to explain the WeightLoss variable from the diet variable. OR Observations:  The F Value is well over 1 indicating that this variable has some explanatory value for WeightLoss.  The P-Value is statistically significant at the 0.05 level.
  • 49.  Let’s look at the ANOVA output for both the Diet and Exercise variables. Observations:  The Diet variable has a F Value of 13.69  The Exercise variable has a F Value of 6.355  Both variables are statistically significant at the 0.05 level Within Group
  • 50.  We can see that the p-value for the Exercise*Diet interaction is 0.000. Because this p- value is so small, we can conclude that there is indeed a significant interaction between Exercise and Diet.  So which diet is better? Our data suggest it’s like asking “ketchup or soy sauce?” The answer is, "It depends."
  • 51.  Since the Exercise*Diet interaction is significant, let’s use an interaction plot to take a closer look:  For participants using the cardio program (shown in black), we can see that diet A is best and results in greater weight loss. However, if you’re following the weight training regimen (shown in red), then diet B is results in greater weight loss than A.
  • 52.  Suppose this interaction wasn't on our radar, and we instead focused only on the individual main effects and their impact on weight loss:  Based on this plot, we would incorrectly conclude that diet A is better than B. As we saw from the interaction plot, that is only true IF we’re looking at the cardio group.  Clearly, we always need to evaluate interactions when analyzing multiple factors. If you don't, you run the risk of drawing incorrect conclusions...and you might just get ketchup with your sushi roll.
  • 53.  ANOVA can also be used as a means to compare two linear regression models using the Chi-square measure.  Here are two regression models we want to compare to each other. The order here is important so make sure you are applying the correct selection of the models.  Model 1: y = a  Model 2: y = b  The p-value of the test is 0.82. It means that the fitted model “Model 1" is not significantly different from Model 2 at the level of α=0.05 . Note that this test makes sense only if Model 1 and Model 2 are nested models. (i.e. it tests whether reduction in the residual sum of squares are statistically significant or not).
  • 54.  Linear regression is used to analyze continuous relationships; however, regression is essentially the same as ANOVA.  In ANOVA, we calculate means and deviations of our data from the means.  In linear regression, we calculate the best line through the data and calculate the deviations of the data from this line.  The F ratio can be calculated in both.
  • 56.  The dynamics and rapid change of the real estate market require business decision makers to seek advanced analytical solutions to maintain a competitive edge.  Real estate pricing and home valuation can greatly benefit from predictive modeling techniques, in particular, linear regression.  The dataset we will be working with reviews home values in Boston, Massachusetts and compiles a number of other statistics to help aid in determining property value.  The goal for this exercise will be to provide a predictive model that can be leveraged to help real estate businesses in the Boston market.
  • 57.  Here is a description of the variables within the dataset:  Our goal is to develop a multiple linear regression model for the median value of a home in Boston based upon the other variables.
  • 58.  First, lets take a look at the raw data in the table.  With so many potential independent variables, we first need to reduce the field of variables to those which can help explain the model.
  • 59.  A review of the correlation matrix indicates that there are a number of variables which we can use when building a model.  Based upon the correlations, we will initially focus on utilizing the following variables: indus, rm, tax, ptratio, and lstat.  Afterwards, we will assess quality of the models performance and utilize an alternative model approach.
  • 60.  A tertiary review of the models output shows a couple of potential issues with the model.  Despite having a correlation of -0.484 to the median value, the indus variable is not statistically significant (0.2802) and should be dropped from the model. The tax variable is also statistically insignificant and should be removed from the model.  The Adjusted R-squared is 0.6772, which indicates a reasonable goodness of fit and 67.7% of the variation in house prices can be explained by the five variables.  There are some who would argue, based off of industry experience, that we could leave the model as is and the model performs reasonably well. Nevertheless, we will rebuild this model and improve its performance.
  • 61.  Lets double check that the dependent variable is normally distributed. It appears that the dataset is left skewed and could benefit from a log transformation. Log Transformation
  • 62.  Lets utilize an automated feature selection procedure called stepwise selection to identify those variables which are both statistically significant and add value to the regression model.  This revised model now has all variables showing statistical significance at the p<0.05 level.  Additionally, the model now has an Adjusted R-Square of 73.4% compared to 67.7% which is a notable improvement.
  • 63.  We checked the VIF to determine whether multicollinearity is an issue. All of the values are below 3 which indicates that this is not an issue.  A review of the QQ Plot indicates that the data generally agrees with a normal distribution, however, there are longer tails at the ends of the distribution.  A review of the residual plots indicates the potential need to apply some transformation of the independent variable to further improve the model. QQ Plot Residual Plot
  • 64.  When we performed a log-level transformation of the data, we now must interpret the change in x as a constant percentage increase in y. Interpretation:  Therefore, each additional room a house has leads to an increase of the price by 12%, holding other variables constant.  If the home is near the river, the price increases by 13%, holding other variables constant.  When a house is close to the main employment centers, the price decreases by 3% per unit. Here is the final model that we had produced:
  • 66.  A supermarket is selling a new type of grape juice in some of its stores for pilot testing.  The senior management wants to understand the relationship between the grape juice and its impact on apple juice, cookie sales, and profitability.  We will showcase how it is possible to build off of linear OLS regression models and econometric methodologies to solve a series of advanced business problems.  The goal will be to provide tangible recommendations from our analyses to help this business manage their portfolio.
  • 67. Our goal is to setup an experiments to analyze:  Which type of in-store advertisement is more effective? The marketing group has placed two types of ads in stores for testing, one theme is natural production of the juice, the other theme is family health caring.  The Price Elasticity – the reactions of sales quantity of the grape juice to its price change.  The Cross-Price Elasticity – the reactions of sales quantity of the grape juice to the price changes of other products such as apple juice and cookies in the same store.  How to find the best unit price of the grape juice which can maximize the profit and the forecast of sales with that price?
  • 68.  Here is a description of the variables within the dataset:  First, lets take a look at the raw data in the table.
  • 69.  From the summary table, we can roughly know the basic statistics of each numeric variable. For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is 335.  We can further explore the distribution of the data of sales by visualizing the data in graphical form as follows. We don’t find outliers in the above box plot graph and the sales data distribution is roughly normal. It is not necessary to apply further data cleaning and treatment to the data set.
  • 70.  The marketing team wants to find out the ad with better effectiveness for sales between the two types of ads:  a natural production theme  with family health caring theme.  To find out the better ad, we can calculate and compare the mean of sales with the two different ad types at the first step.  The mean of sales with nature product theme is about 187; the mean of sales with family health caring theme is about 247.  It looks like that the latter one is better.
  • 71.  To find out how likely the conclusion is correct for the whole population, it is necessary to do statistical testing – two- sample t-test.  We can see that both datasets are normally distributed and to be certain we can run the Shapiro-Wilk test.  The p-values of the Shapiro-Wilk tests are larger than 0.05, so there is no strong evidence to reject the null hypothesis that the two groups of sales data are normally distributed.
  • 72.  Now we can conduct the Welch two sample t-test since the t-test assumptions are met. From the output of t-test above, we can say that:  We have strong evidence to say that the population means of the sales with the two different ad types are different because the p-value of the t-test is very small;  With 95% confidence, we can estimate that the mean of the sales with natural production theme ad is somewhere in 27 to 93 units less than that of the sales with family health caring theme ad.  So the conclusion is that the ad with the theme of family health caring is BETTER.
  • 73.  With the information given in the data set, we can explore how grape juice price, ad type, apple juice price, cookies price influence the sales of grape juice in a store by multiple linear regression analysis.  Here, “sales” is the dependent variable and the others are independent variables.  Let’s investigate the correlation between the sales and other variables by displaying the correlation coefficients in pairs.  The correlation coefficients between sales and price, ad type, price apple, and price cookies are 0.85, 0.58, 0.37, and 0.37 respectively, that means they all might have some influences to the sales.
  • 74.  We can try to add all of the independent variables into the regression model:  The p-value for Price, Ad Type, and Price Cookies is much less than 0.05. They are significant in explaining the sales. We are confident to include these variables into the model.  The p-value of Price Apple is a bit larger than 0.05, seems there are no strong evidence for apple juice price to explain the sales. However, according to our real-life experience, we know when apple juice price is lower, consumers likely to buy more apple juice, and then the sales of other fruit juice will decrease.  So we can also add it into the model to explain the grape juice sales.  The Adjusted R-squared is 0.881, which indicates a reasonable goodness of fit and 88% of the variation in sales can be explained by the four variables.
  • 75.  The assumptions for the regression to be true are that data are random and independent; residuals are normally distributed and have constant variance. Let’s check the residuals assumptions visually.  The Residuals vs Fitted graph above shows that the residuals scatter around the fitted line with no obvious pattern, and the Normal Q-Q graph shows that basically the residuals are normally distributed. The assumptions are met.  The VIF test value for each variable is close to 1, which means the multicollinearity is very low among these variables.
  • 76.  With model established, we can analysis the Price Elasticity(PE) and Cross-price Elasticity(CPE) to predict the reactions of sales quantity to price. Price Elasticity  PE = (ΔQ/Q) / (ΔP/P) = (ΔQ/ΔP) * (P/Q) = -51.24 * 0.045 = -2.3  P is price, Q is sales quantity  ΔQ/ΔP = -51.24 , the parameter before the variable “price” in the above model  P/Q = 9.738 / 216.7 = 0.045  P is the mean of prices in the dataset, Q is the mean of the Sales variable. Interpretation: The PE indicates that 10% decrease in grape juice price will increase the grape juice sales by 23%, and vice versa. Linear Regression Model:
  • 77. Let’s further calculate the CPE on apple juice and cookies to analyze the how the change of apple juice price and cookies price influence the sales of grape juice. Cross Price Elasticity  CPEapple = (ΔQ/ΔPapple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78  CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = – 1.12 Interpretation:  The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales of grape juice by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.  The CPEcookies indicates that 10% decrease in cookies price will INCREASE the grape juice sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place the two products together will likely increase the sales for both.  We can also know that the grape juice sales increase 29.74 units when using the ad with the family health caring theme (ad_type = 1).
  • 78.  Usually companies want to get higher profit rather than just higher sales quantity.  So, how to set the optimal price for the new grape juice to get the maximum profit based on the dataset collected in the pilot period and our regression model?  To simplify the question, we can let the Ad Type = 1, the Price Apple = 7.659 (mean value), and the Price Cookies = 9.738 (mean value). The model is simplified as follows:  Sales = 774.81 – 51.24 * price + 29.74 * 1 + 22.1 * 7.659 – 25.28 * 9.738  Sales = 772.64 – 51.24*price
  • 79.  Assume the marginal cost (C) per unit of grape juice is $5.00. We can calculate the profit (Y) by the following formula:  Y = (price – C) * Sales Quantity = (price – 5) * (772.64 – 51.24*price)  Y = – 51.24 * price2 + 1028.84 * price – 3863.2  To get the optimal price to maximize Y, we can use the following R function.
  • 80.  The optimal price is $10.04; the maximum profit will be $1301 according to the above output. In reality, we can reasonably set the price to be $10.00 or $9.99.  We can further use the model to predict the sales while the price is $10.00.  Additionally, the ad type = 1  Mean Price Apple = 7.659  Mean Price Cookies = 9.738 Sales = 774.81 – 51.24 * Price + 29.74 * Ad Type + 22.08 * Price Apple – 25.27 * Price Cookies Sales = 774.81 – 51.24 * (10) + 29.74 * (1)+ 22.08 * (7.659) – 25.27 *(9.738)  The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence in a store in one week on average.  Based on the forecast and other factors, the supermarket can prepare the inventory for all of its stores after the pilot period. Linear Regression Model:
  • 81.  Reside in Wayne, Illinois  Active Semi-Professional Classical Musician (Bassoon).  Married my wife on 10/10/10 and been together for 10 years.  Pet Yorkshire Terrier / Toy Poodle named Brunzie.  Pet Maine Coons’ named Maximus Power and Nemesis Gul du Cat.  Enjoy Cooking, Hiking, Cycling, Kayaking, and Astronomy.  Self proclaimed Data Nerd and Technology Lover.
  • 83.  http://en.wikipedia.org/wiki/Regression_analysis  http://www.ftpress.com/articles/article.aspx?p=2133374  http://people.duke.edu/~rnau/Notes_on_linear_regression_analysis--Robert_Nau.pdf  http://www.theanalysisfactor.com/interpreting-interactions-in-regression/  http://www.edanzediting.com/blog/statistics_anova_explained#.VIdeEo0tBSM  http://en.wikipedia.org/wiki/Ordinary_least_squares  http://www.unt.edu/rss/class/mike/6810/OLS%20Regression%20Summary.pdf  http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i- interpret-r-squared-and-assess-the-goodness-of-fit  http://www.chsbs.cmich.edu/fattah/courses/empirical/multicollinearity.html  http://home.wlu.edu/~gusej/econ398/notes/logRegressions.pdf  http://www.dataapple.net/?p=19