This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
Linear regression and correlation analysis ppt @ bec domsBabasab Patil
This document introduces linear regression and correlation analysis. It discusses calculating and interpreting the correlation coefficient and linear regression equation to determine the relationship between two variables. It covers scatter plots, the assumptions of regression analysis, and using regression to predict and describe relationships in data. Key terms introduced include the correlation coefficient, linear regression model, explained and unexplained variation, and the coefficient of determination.
This document discusses multiple regression analysis and its use in predicting relationships between variables. Multiple regression allows prediction of a criterion variable from two or more predictor variables. Key aspects covered include the multiple correlation coefficient (R), squared correlation coefficient (R2), adjusted R2, regression coefficients, significance testing using t-tests and F-tests, and considerations for using multiple regression such as sample size and normality assumptions.
- Univariate analysis refers to analyzing one variable at a time using statistical measures like proportions, percentages, means, medians, and modes to describe data.
- These measures provide a "snapshot" of a variable through tools like frequency tables and charts to understand patterns and the distribution of cases.
- Measures of central tendency like the mean, median and mode indicate typical or average values, while measures of dispersion like the standard deviation and range indicate how spread out or varied the data are around central values.
The document provides an overview of inferential statistics. It defines inferential statistics as making generalizations about a larger population based on a sample. Key topics covered include hypothesis testing, types of hypotheses, significance tests, critical values, p-values, confidence intervals, z-tests, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document aims to explain these statistical concepts and techniques at a high level.
Logistic regression is used when the dependent variable is dichotomous (has two possible outcomes) and can be applied to predict group membership. It forms a best-fitting equation to maximize the probability of correctly classifying cases into categories based on the independent variables. The logistic regression equation transforms the dependent variable into a probability rather than a numerical value to address limitations of linear regression for dichotomous outcomes.
Regression analysis is a statistical technique for predicting a dependent variable based on one or more independent variables. Simple linear regression fits a straight line to the data to predict a continuous dependent variable (y) from a single independent variable (x). The output is an equation of the form y= b0 + b1x + ε, where b0 is the y-intercept, b1 is the slope, and ε is the error. Multiple linear regression extends this to include more than one independent variable. Regression analysis calculates the "best fit" line that minimizes the residuals, or differences between predicted and observed y values.
This document provides an overview of data analysis and statistics concepts for a training session. It begins with an agenda outlining topics like descriptive statistics, inferential statistics, and independent vs dependent samples. Descriptive statistics concepts covered include measures of central tendency (mean, median, mode), measures of variability (range, standard deviation), and charts. Inferential statistics discusses estimating population parameters, hypothesis testing, and statistical tests like t-tests, ANOVA, and chi-squared. The document provides examples and online simulation tools. It concludes with some practical tips for data analysis like checking for errors, reviewing findings early, and consulting a statistician on analysis plans.
Factor analysis is a statistical technique used to identify underlying factors that explain the pattern of correlations within a set of observed variables. It groups variables that are highly correlated with each other into factors to reduce data dimensionality. The key steps are extracting factors with eigenvalues greater than 1, evaluating factor loadings to interpret the grouping of variables, and rotating factors to maximize interpretability of the results. SPSS output includes correlation coefficients, KMO/Bartlett's tests of sampling adequacy, eigenvalues, communalities, scree plots, and rotated component matrices.
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
This document provides an overview of regression analysis, including:
- Regression analysis measures the average relationship between variables to predict dependent variables from independent variables and show relationships.
- It is widely used in business to predict things like production, prices, and profits. It is also used in sociological and economic studies.
- There are three main methods for studying regression: least squares method, deviations from means method, and deviations from assumed means method. Examples are provided of calculating regression equations for bivariate data using each method.
Regression Analysis Research PresentationDianaWilbur
This video discusses regression analysis techniques and provides an example study comparing the number of single-leg heel raises to ankle plantarflexion strength as measured by dynamometry. Simple linear regression was used to analyze the relationship between the independent variable of heel raise count and dependent variable of dynamometry score. The results showed heel raise count was not a strong predictor of dynamometry score. Heel raises may be a better test of endurance while dynamometry provides a measure of strength.
Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables.
This document provides an overview of various quantitative data analysis techniques including parametric and non-parametric statistics, descriptive statistics, contingency analysis, t-tests, ANOVA, correlation, and regression. It discusses assumptions and processes for each technique and how to interpret results. Computer software like SPSS and SAS can be used to analyze large, complex datasets.
An Introduction to Factor analysis pptMukesh Bisht
This document discusses exploratory factor analysis (EFA). EFA is used to identify underlying factors that explain the pattern of correlations within a set of observed variables. The document outlines the steps of EFA, including testing assumptions, constructing a correlation matrix, determining the number of factors, rotating factors, and interpreting the factor loadings. It provides an example of running EFA on a dataset with 11 physical performance and anthropometric variables from 21 participants. The analysis extracts 3 factors that explain over 80% of the total variance.
This document outlines a course on multivariate data analysis. It introduces key topics that will be covered, including matrix algebra, the multivariate normal distribution, principal component analysis, factor analysis, cluster analysis, discriminant analysis, and canonical correlations. The course workload consists of 40% theory and 60% practice, including a group project and weekly presentations. R will be the main software used. Examples of multivariate data and applications in various fields like business, health, and education are also provided.
This document discusses non-linear regression. Non-linear regression uses regression equations that are non-linear in terms of the variables or parameters. Two main types are discussed: models that are nonlinear in variables but linear in parameters, and models that are nonlinear in both variables and parameters. Several non-linear regression methods are described, including direct computation, derivative, and self-starting methods. Examples of non-linear regression models and the differences between linear and non-linear regression are provided. Advantages of non-linear regression include applying differential weighting and identifying outliers.
This document provides an overview of logistic regression. It begins by defining logistic regression as a specialized form of regression used when the dependent variable is dichotomous while the independent variables can be of any type. It notes logistic regression allows prediction of discrete variables from continuous and discrete predictors without assumptions about variable distributions. The document then discusses why logistic regression is used when assumptions of other regressions like normality and equal variance are violated. It also outlines how to perform and interpret logistic regression including assessing model fit. Finally, it provides an example research question and hypotheses about predicting solar panel adoption using household income and mortgage as predictors.
Here are the key steps and results:
1. Load the data and run a multiple linear regression with x1 as the target and x2, x3 as predictors.
R-squared is 0.89
2. Add x4, x5 as additional predictors.
R-squared increases to 0.94
3. Add x6, x7 as additional predictors.
R-squared further increases to 0.98
So as more predictors are added, the R-squared value increases, indicating more of the variation in x1 is explained by the model. However, adding too many predictors can lead to overfitting.
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
Quantile regression is an extension of linear regression that relates specific quantiles (percentiles) of the target variable to the predictor variables rather than just the mean. It makes fewer assumptions than ordinary least squares regression about the distribution of the target variable and is more robust to outliers. Quantile regression can provide a more complete picture of the relationship between variables by examining how predictors influence different parts of the conditional distribution.
Recep maz msb 701 quantitative analysis for managersrecepmaz
Regression analysis is a statistical method used to understand the relationship between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. The summary examines a regression model with the total number of white people ages 18-64 as the dependent variable and the number of white people below the poverty level ages 18-64 as the independent variable. The regression outputs, including the R-squared, adjusted R-squared, significance F, and p-values are interpreted to evaluate the model and relationship between the variables. The analysis finds there is no significant relationship between the two variables in the model.
Recep maz msb 701 quantitative analysis for managersrecepmaz
Regression analysis is a statistical method used to understand the relationship between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. The summary examines a regression model with the total number of white people ages 18-64 as the dependent variable and the number of white people below the poverty level ages 18-64 as the independent variable. The regression outputs, including the R-squared, adjusted R-squared, significance F, and p-values are interpreted to evaluate the model and relationship between the variables. The analysis finds there is no significant relationship between the two variables in the model.
Linear regression is a supervised machine learning technique used to model the relationship between a continuous dependent variable and one or more independent variables. It is commonly used for prediction and forecasting. The regression line represents the best fit line for the data using the least squares method to minimize the distance between the observed data points and the regression line. R-squared measures how well the regression line represents the data, on a scale of 0-100%. Linear regression performs well when data is linearly separable but has limitations such as assuming linear relationships and being sensitive to outliers and multicollinearity.
Linear regression is a supervised machine learning technique used to model the relationship between a continuous dependent variable and one or more independent variables. It finds the line of best fit that minimizes the distance between the observed data points and the regression line. The slope of the regression line is determined using the least squares method. R-squared measures how well the regression line represents the data, with values closer to 1 indicating a stronger relationship. The standard error of the estimate quantifies the accuracy of predictions made by the linear regression model. Linear regression performs well when data is linearly separable, but has limitations such as an assumption of linear relationships and sensitivity to outliers and multicollinearity.
- The document discusses the multiple linear regression model, including defining dependent and independent variables, the motivation for using multiple regression, and providing examples.
- It describes how OLS estimation works by minimizing the sum of squared residuals to estimate the intercept and slope parameters. It also discusses how to interpret coefficients from multiple regression by "partialing out" the effects of other independent variables.
- The key assumptions for the multiple regression model are outlined, including that it is linear in parameters, has a random sample with no perfect collinearity, has a zero conditional mean, and is homoscedastic. Violations of these assumptions can cause issues like omitted variable bias.
This presentation educates you about Linear Regression, SPSS Linear regression, Linear regression method, Why linear regression is important?, Assumptions of effective linear regression and Linear-regression assumptions.
For more topics stay tuned with Learnbay.
This document discusses multiple linear regression analysis. It begins by introducing the basic multiple regression model that includes more than one predictor variable. It then discusses the assumptions of multiple regression including adequate sample size, absence of outliers and multicollinearity, and normality, linearity and homoscedasticity of residuals. The document provides an example of predicting house prices using living area and distance from the city center as predictor variables. It shows how to check assumptions, interpret the regression output and make predictions using the fitted model.
Discriminant analysis (DA) is a statistical technique used to predict group membership when the dependent variable is categorical and the independent variables are continuous. It identifies which variables discriminate between two or more naturally occurring groups. DA develops a linear equation to predict group membership based on weighted combinations of predictor variables. It aims to maximize the distance between group means to achieve strong discriminatory power. Like regression, DA assumes variables are normally distributed, cases are randomly sampled, and groups are mutually exclusive and collectively exhaustive. It requires at least two groups with minimal overlap and similar group sizes of at least five cases. DA can classify new cases into groups based on the discriminant functions derived from existing data.
This document discusses multiple regression analysis. It begins by explaining the linear multiple regression model and key steps in regression modeling such as specifying the model, collecting data, and evaluating the model. It then covers assumptions of multiple regression including linearity and independence of errors. The document presents a mini-case study predicting home heating oil consumption based on temperature and insulation. It provides the multiple regression equation developed from the case study data and uses the equation to make predictions about oil consumption. Finally, it discusses interpreting the coefficient of multiple determination (R2) which indicates how well the model explains the variation in the dependent variable.
The document discusses multiple linear regression analysis. It defines multiple regression as exploring the relationship between one continuous dependent variable and multiple independent variables. It provides examples of multiple regression models with one and two predictors. It also discusses assumptions of multiple regression like sample size, multicollinearity, outliers, and normality of residuals. Key steps in multiple regression like estimating parameters, assessing model fit and diagnosing assumptions are outlined.
This document provides an overview of forecasting using Eviews 2.0 software. It distinguishes between ex post and ex ante forecasting. Ex post forecasts use known data to evaluate a forecasting model, while ex ante forecasts predict values using uncertain explanatory variables. The document then discusses univariate forecasting methods in Eviews, including trend extrapolation, modeling trend behavior, and analyzing residuals to check assumptions. It provides examples of estimating a trend model, viewing residuals, and making forecasts in Eviews.
This document discusses analyzing and summarizing relationships between two quantitative variables (bivariate data) using scatterplots. It covers key topics like correlation, linear regression lines, residuals, outliers and influential points. Scatterplots display the relationship between two variables and can show positive or negative linear associations or no relationship. Correlation coefficients measure the strength and direction of linear relationships, while regression lines predict variable relationships. Residual plots assess linearity and outliers.
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxbudbarber38650
FSE 200
Adkins Page 1 of 10
Simple Linear Regression
Correlation only measures the strength and direction of the linear relationship between two quantitative variables. If the relationship is linear, then we would like to try to model that relationship with the equation of a line. We will use a regression line to describe the relationship between an explanatory variable and a response variable.
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.
Ex. It has been suggested that there is a relationship between sleep deprivation of employees and the ability to complete simple tasks. To evaluate this hypothesis, 12 people were asked to solve simple tasks after having been without sleep for 15, 18, 21, and 24 hours. The sample data are shown below.
Subject
Hours without sleep, x
Tasks completed, y
1
15
13
2
15
9
3
15
15
4
18
8
5
18
12
6
18
10
7
21
5
8
21
8
9
21
7
10
24
3
11
24
5
12
24
4
Draw a scatterplot and describe the relationship. Lay a straight-edge on top of the plot and move it around until you find what you think might be a “line of best fit.” Then try to predict the number of tasks completed for someone having been without sleep 16 hours.
Was your line the same as that of the classmate sitting next to you? Probably not. We need a method that we can use to find the “best” regression line to use for prediction. The method we will use is called least-squares. No line will pass exactly through all the points in the scatterplot. When we use the line to predict a y for a given x value, if there is a data point with that same x value, we can compute the error (residual):
Our goal is going to be to make the vertical distances from the line as small as possible. The most commonly used method for doing this is the least-squares method.
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of the Least-Squares Regression Line
· Least-Squares Regression Line:
· Slope of the Regression Line:
· Intercept of the Regression Line:
Generally, regression is performed using statistical software. Clearly, given the appropriate information, the above formulas are simple to use.
Once we have the regression line, how do we interpret it, and what can we do with it?
The slope of a regression line is the rate of change, that amount of change in when x increases by 1.
The intercept of the regression line is the value of when x = 0. It is statistically meaningful only when x can take on values that are close to zero.
To make a prediction, just substitute an x-value into the equation and find .
To plot the line on a scatterplot, just find a couple of points on the regression line, one near each end of the range of x in the data. Plot the points and connect them with a line. .
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
This chapter discusses diagnostic tests for quantitative research methods models, including tests for autocorrelation, multicollinearity, heteroscedasticity, model misspecification, parameter stability, and forecastability. It defines each concept, explains the consequences when present in a model, how to detect each issue, and potential remedial measures. The key topics covered are the meaning and detection of autocorrelation using correlograms and Durbin-Watson tests, the consequences and detection of multicollinearity, and the consequences and detection of heteroscedasticity using White and ARCH tests.
Applications of regression analysis - Measurement of validity of relationshipRithish Kumar
This document provides a summary of regression analysis in 9 steps: 1) Specify dependent and independent variables, 2) Check for linearity with scatter plots, 3) Transform variables if nonlinear, 4) Estimate the regression model, 5) Test the model fit with R2, 6) Perform a joint hypothesis test of the coefficients, 7) Test individual coefficients, 8) Check for violations of assumptions like autocorrelation and heteroscedasticity, 9) Interpret the intercept and slope coefficients. Regression analysis is used to determine relationships between variables and estimate how changes in independents impact dependents.
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
The document discusses various regression techniques including ridge regression, lasso regression, and elastic net regression. It begins with an overview of advancements in regression analysis since the late 1800s/early 1900s enabled by increased computing power. Modern high-dimensional data often has many independent variables, requiring improved regression methods. The document then provides technical explanations and formulas for ordinary least squares regression, ridge regression, lasso regression, and their properties such as bias-variance tradeoffs. It explains how ridge and lasso regression address limitations of OLS through regularization that shrinks coefficients.
Similar to Data Science - Part IV - Regression Analysis & ANOVA (20)
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
This lecture provides an overview of Image Processing and Deep Learning for the applications of data science and machine learning. We will go through examples of image processing techniques using a couple of different R packages. Afterwards, we will shift our focus and dive into the topics of Deep Neural Networks and Deep Learning. We will discuss topics including Deep Boltzmann Machines, Deep Belief Networks, & Convolutional Neural Networks and finish the presentation with a practical exercise in hand writing recognition technique.
Data Science - Part XVI - Fourier AnalysisDerek Kane
This lecture provides an overview of the Fourier Analysis and the Fourier Transform as applied in Machine Learning. We will go through some methods of calibration and diagnostics and then apply the technique on a time series prediction of Manufacturing Order Volumes utilizing Fourier Analysis and Neural Networks.
Data Science - Part XV - MARS, Logistic Regression, & Survival AnalysisDerek Kane
This lecture provides an overview on extending the regression concepts brought forth in previous lectures. We will start off by going through a broad overview of the Multivariate Adaptive Regression Splines Algorithm, Logistic Regression, and then explore the Survival Analysis. The presentation will culminate with a real world example on how these techniques can be used in the US criminal justice system.
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
This lecture provides an overview on biological evolution and genetic algorithms in a machine learning context. We will start off by going through a broad overview of the biological evolutionary process and then explore how genetic algorithms can be developed that mimic these processes. We will dive into the types of problems that can be solved with genetic algorithms and then we will conclude with a series of practical examples in R which highlights the techniques: The Knapsack Problem, Feature Selection and OLS regression, and constrained optimizations.
Data Science - Part XIII - Hidden Markov ModelsDerek Kane
This lecture provides an overview on Markov processes and Hidden Markov Models. We will start off by going through a basic conceptual example and then explore the types of problems that can be solved with HMM's. The underlying algorithms will be discussed in detail with a quantitative focus and then we will conclude with a practical example concerning stock market prediction which highlights the techniques.
This is an introduction to text analytics for advanced business users and IT professionals with limited programming expertise. The presentation will go through different areas of text analytics as well as provide some real work examples that help to make the subject matter a little more relatable. We will cover topics like search engine building, categorization (supervised and unsupervised), clustering, NLP, and social media analysis.
Data Science - Part X - Time Series ForecastingDerek Kane
This lecture provides an overview of Time Series forecasting techniques and the process of creating effective forecasts. We will go through some of the popular statistical methods including time series decomposition, exponential smoothing, Holt-Winters, ARIMA, and GLM Models. These topics will be discussed in detail and we will go through the calibration and diagnostics effective time series models on a number of diverse datasets.
Data Science - Part IX - Support Vector MachineDerek Kane
This lecture provides an overview of Support Vector Machines in a more relatable and accessible manner. We will go through some methods of calibration and diagnostics of SVM and then apply the technique to accurately detect breast cancer within a dataset.
Data Science - Part VIII - Artifical Neural NetworkDerek Kane
This lecture provides an overview of biological based learning in the brain and how to simulate this approach through the use of feed-forward artificial neural networks with back propagation. We will go through some methods of calibration and diagnostics and then apply the technique on three different data mining tasks: binary prediction, classification, and time series prediction.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
Data Science - Part VI - Market Basket and Product Recommendation EnginesDerek Kane
This lecture provides an overview of association analysis, which includes topics such as market basket analysis and product recommendation engines. The first practical example centers around analyzing supermarket retailer product receipts and the second example touches upon the use of the association rules in the political arena.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
Data Science - Part II - Working with R & R studioDerek Kane
This tutorial will go through a basic primer for individuals who want to get started with predictive analytics through downloading the open source (FREE) language R. I will go through some tips to get up and started and building predictive models ASAP.
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
This is the first lecture in a series of data analytics topics and geared to individuals and business professionals who have no understand of building modern analytics approaches. This lecture provides an overview of the models and techniques we will address throughout the lecture series, we will discuss Business Intelligence topics, predictive analytics, and big data technologies. Finally, we will walk through a simple yet effective example which showcases the potential of predictive analytics in a business context.
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
2. Introduction to Regression Analysis
Ordinary Least Squares
Assumptions
Detecting Violations
Interaction Terms
Log-Level & Log-Log Transformations
ANOVA
Practical Example
Real Estate
Supermarket Marketing
3. Regression Analysis is the art and science of fitting
straight lines to patterns of data.
Regression analysis is widely used for prediction
and forecasting, where its use has substantial
overlap with the field of machine learning.
In a linear regression model, the variable of
interest is (dependent variable) is predicted from a
single or multiple group of variables (independent
variable) using a linear mathematical formula.
Regression analysis is also used to understand
which among the independent variables are
related to the dependent variable, and to explore
the forms of these relationships.
4. History:
The earliest form of regression was the method of
least squares, which was published by a French
mathematician Adrien-Marie Legendre in 1805
and by German mathematician Gauss in 1809.
Legendre and Gauss both applied the method to
the problem of determining, from astronomical
observations, the orbits of bodies about the Sun
(mostly comets, but also later the then newly
discovered minor planets).
In the 1950s and 1960s, economists used
electromechanical desk calculators to calculate
regressions. Before 1970, it sometimes took up to
24 hours to receive the result from one
regression.
The first published picture of a regression
line by Francis Galton in 1877
5. Many techniques for carrying out regression
analysis have been developed.
Familiar methods such as linear regression
and ordinary least squares regression are
parametric, in that the regression function is
defined in terms of a finite number of
unknown parameters that are estimated
from the data.
Non-parametric regression refers to
techniques that allow the regression function
to lie in a specified set of functions, which
may be infinite-dimensional.
Our focus will be on ordinary least squares
regression and parametric methods.
6. Regression analysis may be used for a wide variety of business applications, such as:
Measuring the impact on a corporation's profits of an increase in profits.
Understanding how sensitive a corporation's sales are to changes in advertising
expenditures.
Seeing how a stock price is affected by changes in interest rates.
Calculating price elasticity for goods and services.
Litigation and information discovery.
Total Quality Control Analyses.
Human Resource and talent evaluation.
Regression analysis may also be used for forecasting purposes; for example, a regression
equation may be used to forecast the future demand for a company's products.
7. Simple Linear Regression Formula
The simple regression model can be represented as follows:
Y = β₀ + β₁X1 + ε
The β₀ represents the Y intercept value, the coefficient β₁ represents the slope of the line,
the X1 is an independent variable and ε is the error term. The error term is the value
needed to correct for a prediction error between the observed and predicted value.
Dependent
Variable
Intercept
Error Term
Independent
Variable
Coefficient
8. Simple Linear Regression Formula
The output of a regression analysis will produce a coefficient table similar to the one
below.
This table shows that the intercept is -114.326 and the Height coefficient is 106.505 +/-
11.55.
This can be interpreted as for each unit increase in X, we can expect that Y will increase
by 106.5
Also, the T value and Pr > |t| indicate that these variables are statistically significant at the
0.05 level and can be included in the model.
9. Multiple Linear Regression Formula
A multiple linear regression is essentially the same as a simple linear regression except
that there can be multiple coefficients and independent variables.
Y = β₀ + β₁X1 +β2X2 +…+ ε
The interpretation of the coefficient is slightly different than in a simple linear regression.
Using the table below the interpretation can be thought of:
For each 1 unit change in Width, increases Y by 94.56. This is while holding all other
coefficients constant.
10. What is Ordinary Least Squares or OLS?
In statistics, ordinary least squares
(OLS) or linear least squares is a
method for estimating the unknown
parameters in a linear regression
model.
The goal of OLS is to minimize the
differences between the observed
responses in some arbitrary dataset
and the responses predicted by the
linear approximation of the data.
11. Visually this is seen as the sum of the
vertical distances between each data
point in the set and the
corresponding point on the
regression line.
The smaller the differences (square
size), the better the model fits the
data.
12. There are a number of classical assumptions which must hold true if we are to trust the
outcome of the regression model.
The sample is representative of the population for the inference prediction.
The error is a random variable with a mean of zero conditional on the
independent variables.
The independent variables are measured with no error.
The predictors are linearly independent, i.e. it is not possible to express any
predictor as a linear combination of the others.
The errors are uncorrelated, that is, the variance–covariance matrix of the errors is
diagonal and each non-zero element is the variance of the error.
The variance of the error is constant across observations (homoscedasticity).
13. Consequences of using an invalid modeling
procedure include:
The consequences have a tremendous impact
on the theory that formed the basis of
investigating this aspect of human nature.
A lack of linear association between
independent and dependent variables, model
misspecification, etc… insinuates that you have
the wrong theory.
Biased, inefficient coefficients due to poor
reliability, collinearity, etc… lead to an incorrect
interpretation regarding your theory.
Outliers imply that you are not able to apply
your theory to the entire population that you
drew your data from.
Over fitting implies that you are overconfident
with your theory.
14. There are a number of statistics and diagnostic tests
we can draw from to evaluate linear regression
models beyond EDA.
Coefficient of Determination
Residual Plot
Breusch-Pagan or White Test
Variance Inflation Factor
Influential Observations
Leverage Points
Cook’s Distance
Etc…
15. R2 : Coefficient of Determination
This is a measure of the goodness of fit for a
linear regression model.
It is the percentage of the dependent variable
variation that is explained by a linear model
R2 = Explained variation / Total variation
R2 is always between 0 and 100%:
0% indicates that the model explains none of
the variability of the dependent data around
its mean.
100% indicates that the model explains all the
variability of the dependent data around its
mean.
16. Are Low R2 Values Inherently Bad?
No!
There are two major reasons why it can be just fine to have low R-squared values.
In some fields, it is entirely expected that your R-squared values will be low. For
example, any field that attempts to predict human behavior, such as psychology,
typically has R-squared values lower than 50%. Humans are simply harder to predict
than, say, physical processes.
Furthermore, if your R-squared value is low but you have statistically significant
predictors, you can still draw important conclusions about how changes in the
predictor values are associated with changes in the response value.
Regardless of the R-squared, the significant coefficients still represent the mean
change in the response for one unit of change in the predictor while holding other
predictors in the model constant. Obviously, this type of information can be
extremely valuable.
17. The number of independent variables in your model will increase the value of R-squared
despite whether the variables offer an increase in explanatory power. To combat this
issue, we should focus on utilizing the Adjusted R-Squared metric which penalizes a
model for having too many variables.
There is no generally accepted technique for relating the number of total observations to
the number of independent variables in a model. One possible rule of thumb suggested
by Good and Hardon is:
Where N is the sample size, n is the number of independent variables, and m is the
number of observations needed to reach the desired precision if the model had only one
variable.
For example, if the dataset contained 1000 observations and the researcher decided that
5 observations are needed to support a single variable, then the maximum number of
independent variables the model can support is 4.
18. Key Limitations of R2
R-squared cannot determine whether the
coefficient estimates and predictions are
biased, which is why you must assess the
residual plots.
R-squared does not indicate whether a
regression model is adequate. You can
have a low R-squared value for a good
model, or a high R-squared value for a
model that does not fit the data!
The R-squared in your output is a biased
estimate of the population R-squared.
19. A residual plot is a scatterplot of the
residuals (difference between the actual
and predicted value) against the
predicted value.
A proper model will exhibit a random
pattern for the spread of the residuals
with no discernable shape.
Residual plots are used extensively in
linear regression analysis for diagnostics
and assumption testing.
If the residuals form a curvature like
shape, then we know that a
transformation will be necessary and can
explore some methods like the Box-Cox.
Random Residuals
Curved Residuals
20. Linear Regression Analysis using OLS contains
an assumption that residuals are identically
distributed across every X variable.
When this condition holds, the error terms are
homoskedastic, which means the errors have
the same scatter regardless of the value of X.
When the scatter of the errors is different,
varying depending on the value of one or more
of the independent variables, the error terms
are heteroskedastic.
A review of a scatterplot of the studentized
residuals against the dependent variable can be
used to detect if heteroskedasticity is present.
The residuals will appear to fan outward in a
distinct pattern.
Heteroskedasticity Pattern
21. Heteroskedasticity has serious consequences for the OLS estimator. Although the OLS
estimator remains unbiased, the estimated SE is wrong. Because of this, confidence
intervals and hypotheses tests cannot be relied on.
The Breusch-Pagan test (alt. White Test) is a method that can be employed to identify
whether or not the error variances are all equal versus the alternative that the error
variances are a multiplicative function of one or more variables.
The results of this test show that the Chi-Square value is fairly low indicating that
heteroskedasticity is probably not a problem.
Techniques to correct heteroskedasticity:
Re-specify the model. (Include omitted variables)
Transform the variables.
Use Weighted Least Squares in place of OLS.
22. What is Multicollinearity?
Collinearity (or multicollinearity) is the undesirable
situation where the correlations among the
independent variables are strong.
In some cases, multiple regression results may
seem paradoxical. For instance, the model may fit
the data well (high F-Test), even though none of
the X variables has a statistically significant impact
on explaining Y.
How is this possible? When two X variables are
highly correlated, they both convey essentially the
same information. When this happens, the X
variables are collinear and the results show
multicollinearity.
23. Why is Multicollinearity a Problem?
Multicollinearity misleadingly inflates the
standard errors of the coefficients.
Thus, it makes some variables
statistically insignificant while they
should be otherwise significant.
It is like two or more people singing
loudly at the same time. One cannot
discern which is which. They offset each
other.
24. How to detect Multicollinearity?
Formally, variance inflation factors (VIF) measure
how much the variance of the estimated
coefficients are increased over the case of no
correlation among the X variables. If no two X
variables are correlated, then all the VIFs will be 1.
If VIF for one of the variables is around or greater
than 5, there is collinearity associated with that
variable.
The easy solution is: If there are two or more
variables that will have a VIF around or greater
than 5, one of these variables must be removed
from the regression model. To determine the best
one to remove, remove each one individually.
Select the regression equation that explains the
most variance (R2 the highest).
VIF > 5,
Collinearity
Present
25. Cook’s distance or Cook’s D is a commonly
used estimate of the influence of a data
point when performing OLS regression.
Studentized residuals is a quotient
resulting from the division of a residual by
an estimate of its standard deviation.
The Bonferroni method is a simple method
that allows many comparison statements
to be made (or confidence intervals to be
constructed) while still assuring an overall
confidence coefficient is maintained.
The hat values represent the predicted Y
values plotted against the actual Y values.
26. Adding interaction terms to a regression model can
greatly expand understanding of the relationships
among the variables in the model and allows more
hypotheses to be tested.
Height = 42 + 2.3 * Bacteria + 11 * Sun
Height = The height of a shrub. (cm)
Bacteria = the amount of bacteria in the soil. (1000
per/ml)
Sun = whether the shrub is located in partial or full
sun. (Sun = 0 partial and Sun = 1 is full)
27. It would be useful to add an interaction term to the
model if we wanted to test the hypothesis that the
relationship between the amount of bacteria in the
soil on the height of the shrub was different in full
sun than in partial sun.
One possibility is that in full sun, plants with more
bacteria in the soil tend to be taller, whereas in
partial sun, plants with more bacteria in the soil are
shorter.
Another possibility is that plants with more bacteria
in the soil tend to be taller in both full and partial
sun, but that the relationship is much more
dramatic in full than in partial sun.
28. The presence of a significant interaction
indicates that the effect of one predictor
variable on the response variable is different at
different values of the other predictor variable.
It is tested by adding a term to the model in
which the two predictor variables are
multiplied.
The regression equation will look like this:
Height = B0 + B1 * Bacteria + B2 * Sun + B3 * (Bacteria * Sun)
29. Adding an interaction term to a model
drastically changes the interpretation of all of
the coefficients.
If there were no interaction term, B1 would
be interpreted as the unique effect of
Bacteria on Height. But the interaction
means that the effect of Bacteria on Height
is different for different values of Sun.
So the unique effect of Bacteria on Height is
not limited to B1, but also depends on the
values of B3 and Sun.
The unique effect of Bacteria is represented
by everything that is multiplied by Bacteria in
the model: B1 + B3*Sun. B1 is now
interpreted as the unique effect of Bacteria
on Height only when Sun = 0.
Height = B0 + B1 * Bacteria + B2 * Sun + B3 * (Bacteria * Sun)
30. In our example, once we add the interaction
term, our model looks like:
Height = 35 + 4.2*Bacteria + 9*Sun +
3.2*Bacteria*Sun
Adding the interaction term changed the
values of B1 and B2.
The effect of Bacteria on Height is now 4.2 +
3.2*Sun. For plants in partial sun, Sun = 0, so
the effect of Bacteria is 4.2 + 3.2 * 0 = 4.2.
So for two plants in partial sun, a plant with
1000 more bacteria/ml in the soil would be
expected to be 4.2 cm taller than a plant
with less bacteria.
31. For plants in full sun, however, the effect
of Bacteria is 4.2 + 3.2*1 = 7.4.
So for two plants in full sun, a plant with
1000 more bacteria/ml in the soil would
be expected to be 7.4 cm taller than a
plant with less bacteria.
Because of the interaction, the effect of
having more bacteria in the soil is
different if a plant is in full or partial sun.
Another way of saying this is that the
slopes of the regression lines between
height and bacteria count are different for
the different categories of sun. B3
indicates how different those slopes are.
Height = 35 + 4.2 * Bacteria + 9 * Sun + 3.2 * (Bacteria*Sun)
32. Interpreting B2 is more difficult.
B2 is the effect of Sun when Bacteria = 0. Since
Bacteria is a continuous variable, it is unlikely that
it equals 0 often, if ever, so B2 can be virtually
meaningless by itself.
Instead, it is more useful to understand the effect
of Sun, but again, this can be difficult.
The effect of Sun is B2 + B3*Bacteria, which is
different at every one of the infinite values of
Bacteria.
For that reason, often the only way to get an
intuitive understanding of the effect of Sun is to
plug a few values of Bacteria into the equation to
see how Height, the response variable, changes.
Height = 35 + 4.2 * Bacteria + 9 * Sun + 3.2 * (Bacteria*Sun)
33. The presentation so far has only considered the following form of a linear regression
equation:
This is also considered a “level-level” specification because the raw values of y are being
regressed against raw values of x
How do we interpret β1?
We differentiate X1 to find the marginal effect of x on y. In this case, β is the marginal
effect.
34. A log-level Regression specification:
This is called a “log-level” specification because the natural log transformed values of Y
are being regressed on the raw values of x.
You might want to run this specification if you think that increases in x lead to a constant
percentage increase in y.
Ex. Wage on Education? Forest Lumber Volume on Years?
35. How do we interpret β1?
First solve for y:
Then differentiate to get the marginal effect.
So the marginal effect depends on the value of y, with β itself represents the growth rate.
For example, if we estimate that β1 is 0.04, we should say that for another year increases
the volume of lumber by 4%.
36. A log-log Regression specification:
This is called a “log-log” specification because the natural log transformed values of Y are
being regressed on the log transformed values of x.
You might want to run this specification if you think that percentage increases in x lead to
a constant percentage changes in y. Ex. Constant Demand Elasticity
To calculate marginal effects. Solve for y…
And differentiate x
37. From the previous slide the marginal effect is:
Solving for β1 we get:
This makes β1 an elasticity. If x1 is a price and y is a demand and we estimate β1 = -0.6, it
means that a 1% increase in the price of a good would lead to a 6% decrease in demand.
38. Analysis of the variance or ANOVA is used to
compare differences of means among more
than 2 groups.
It does this by looking at variation in the data
and where that variation is found (hence its
name).
Specifically, ANOVA compares the amount of
variation between groups with the amount of
variation within groups.
It can be used for both observational and
experimental studies.
39. When we take samples from a population, we expect each sample mean to differ
simply because we are taking a sample rather than measuring the whole
population; this is called sampling error but is often referred to more informally as
the effects of “chance”.
Thus, we always expect there to be some differences in means among different
groups.
The question is: is the difference among groups greater than that expected to be
caused by chance? In other words, is there likely to be a true (real) difference in
the population mean?
40. The ANOVA model
Mathematically, ANOVA can be written as:
xij = μi + εij
where x are the individual data points (i and j
denote the group and the individual
observation), ε is the unexplained variation and
the parameters of the model (μ) are the
population means of each group. Thus, each
data point (xij) is its group mean plus error.
Assumptions of ANOVA
The response is normally distributed
Variance is similar within different groups
The data points are independent
41. Hypothesis testing
Like other classical statistical tests, we use ANOVA to calculate a test statistic (the F-ratio)
with which we can obtain the probability (the P-value) of obtaining the data assuming the
null hypothesis.
Null hypothesis: all population means are equal
Alternative hypothesis: at least one population mean is different from the rest.
A significant P-value (usually taken as P<0.05) suggests that at least one group mean is
significantly different from the others. In other words, a variable with p<0.05 allows for us
to consider including the variable within a predictive model.
ANOVA separates the variation in the dataset into 2 parts: between-group and within-
group. These variations are called the sums of squares, which can be seen in the following
slides.
42. Calculation of the F ratio
Step 1) Variation between groups
The between-group variation (or between-group sums of squares, SS) is calculated by
comparing the mean of each group with the overall mean of the data.
Specifically, this is:
𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝑆𝑆 = 𝑛1 𝑥1 − 𝑥 2 + 𝑛2 𝑥2 − 𝑥 2 + 𝑛3 𝑥3 − 𝑥 2
We then divide the BSS by the number of degrees of freedom [this is like sample size,
except it is n-1, because the deviations must sum to zero, and once you know n-1, the
last one is also known] to get our estimate of the mean variation between groups.
43. Step 2) Variation within groups
The within-group variation (or the within-group sums of squares) is the variation of each
observation from its group mean.
𝑆𝑆 𝑟 = 𝑠2
𝑔𝑟𝑜𝑢𝑝1 𝑛 𝑔𝑟𝑜𝑢𝑝1 − 1 + 𝑠2
𝑔𝑟𝑜𝑢𝑝2 𝑛 𝑔𝑟𝑜𝑢𝑝2 − 1 + 𝑠2
𝑔𝑟𝑜𝑢𝑝3 𝑛 𝑔𝑟𝑜𝑢𝑝3 − 1
i.e., by adding up the variance of each group times by the degrees of freedom of each
group. Note, you might also come across the total SS (sum of ). Within SS is then Total SS
minus Between SS.
As before, we then divide by the total degrees of freedom to get the mean variation
within groups.
44. Step 3) The F ratio
The F ratio is then calculated as:
𝐹 𝑅𝑎𝑡𝑖𝑜 =
𝑀𝑒𝑎𝑛 𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝 𝑆𝑆
𝑀𝑒𝑎𝑛 𝑊𝑖𝑡ℎ𝑖𝑛 𝐺𝑟𝑜𝑢𝑝 𝑆𝑆
If the average difference between groups is similar to that within groups, the F ratio is
about 1. As the average difference between groups becomes greater than that within
groups, the F ratio becomes larger than 1.
Therefore, variables with higher F Ratio values provide greater explanatory power when
utilized in predictive models.
To obtain a P-value, it can be tested against the F-distribution of a random variable with
the degrees of freedom associated with the numerator and denominator of the ratio. The
P-value is the probably of getting that F ratio or a greater one. Larger F-ratios gives
smaller P-values.
45. Do you prefer ketchup or soy sauce?
If someone asked you this question, your answer would likely depend upon what you
were eating. You probably wouldn't dunk your spicy tuna roll in ketchup. And most
people (pregnant moms-to-be excluded) don't seem to fancy eating soy sauce with hot
French fries.
OR
46. A Common Error When Using ANOVA to Assess
Variables
So you collect data about your variables of interest,
and now you're ready to do your analysis. This is
where many people make the unfortunate mistake of
looking only at each variables individually.
In addition to considering how each variable impacts
your response variable, you also need to evaluate the
interaction between those variables and determine if
any of those are significant as well.
And much like your preference for ketchup versus soy
sauce depends upon what you’re eating, optimum
settings for a given variable will depend upon the
settings of another variable when an interaction is
present.
47. How to Evaluate and Interpret an Interaction
Let’s use a weight loss example to illustrate how we can evaluate an interaction between
factors. We're evaluating 2 different diets and 2 different exercise programs: one focused
on cardio and one focused on weight training. We want to determine which result in
greater weight loss. We randomly assign participants to either diet A or B and either the
cardio or weight training regimen, and then record the amount of weight they’ve lost
after 1 month.
Here is a snapshot of the data:
48. Example: We are wanting to understand how to explain the WeightLoss variable from the
diet variable.
OR
Observations:
The F Value is well over 1 indicating that this variable
has some explanatory value for WeightLoss.
The P-Value is statistically significant at the 0.05 level.
49. Let’s look at the ANOVA output for both the Diet and Exercise variables.
Observations:
The Diet variable has a F Value of 13.69
The Exercise variable has a F Value of 6.355
Both variables are statistically significant at the 0.05 level
Within Group
50. We can see that the p-value for the Exercise*Diet interaction is 0.000. Because this p-
value is so small, we can conclude that there is indeed a significant interaction between
Exercise and Diet.
So which diet is better? Our data suggest it’s like asking “ketchup or soy sauce?” The
answer is, "It depends."
51. Since the Exercise*Diet interaction is significant, let’s use an interaction plot to take a
closer look:
For participants using the cardio program (shown in black), we can see that diet A is best
and results in greater weight loss. However, if you’re following the weight training
regimen (shown in red), then diet B is results in greater weight loss than A.
52. Suppose this interaction wasn't on our radar, and we instead focused only on the
individual main effects and their impact on weight loss:
Based on this plot, we would incorrectly conclude that diet A is better than B. As we saw
from the interaction plot, that is only true IF we’re looking at the cardio group.
Clearly, we always need to evaluate interactions when analyzing multiple factors. If you
don't, you run the risk of drawing incorrect conclusions...and you might just get ketchup
with your sushi roll.
53. ANOVA can also be used as a means to compare two linear regression models using the
Chi-square measure.
Here are two regression models we want to compare to each other. The order here is
important so make sure you are applying the correct selection of the models.
Model 1: y = a
Model 2: y = b
The p-value of the test is 0.82. It means that the fitted model “Model 1" is not significantly
different from Model 2 at the level of α=0.05 . Note that this test makes sense only if
Model 1 and Model 2 are nested models. (i.e. it tests whether reduction in the residual
sum of squares are statistically significant or not).
54. Linear regression is used to analyze
continuous relationships; however,
regression is essentially the same as
ANOVA.
In ANOVA, we calculate means and
deviations of our data from the means.
In linear regression, we calculate the best
line through the data and calculate the
deviations of the data from this line.
The F ratio can be calculated in both.
56. The dynamics and rapid change of the real estate
market require business decision makers to seek
advanced analytical solutions to maintain a
competitive edge.
Real estate pricing and home valuation can
greatly benefit from predictive modeling
techniques, in particular, linear regression.
The dataset we will be working with reviews home
values in Boston, Massachusetts and compiles a
number of other statistics to help aid in
determining property value.
The goal for this exercise will be to provide a
predictive model that can be leveraged to help
real estate businesses in the Boston market.
57. Here is a description of the variables within the dataset:
Our goal is to develop a multiple linear regression model for the median value of a home in
Boston based upon the other variables.
58. First, lets take a look at the raw data in the table.
With so many potential independent variables, we first need to reduce the field of variables
to those which can help explain the model.
59. A review of the correlation matrix indicates
that there are a number of variables which we
can use when building a model.
Based upon the correlations, we will initially
focus on utilizing the following variables:
indus, rm, tax, ptratio, and lstat.
Afterwards, we will assess quality of the
models performance and utilize an alternative
model approach.
60. A tertiary review of the models output shows a
couple of potential issues with the model.
Despite having a correlation of -0.484 to the
median value, the indus variable is not statistically
significant (0.2802) and should be dropped from
the model. The tax variable is also statistically
insignificant and should be removed from the
model.
The Adjusted R-squared is 0.6772, which indicates
a reasonable goodness of fit and 67.7% of the
variation in house prices can be explained by the
five variables.
There are some who would argue, based off of
industry experience, that we could leave the
model as is and the model performs reasonably
well. Nevertheless, we will rebuild this model and
improve its performance.
61. Lets double check that the dependent variable is normally distributed. It appears that the dataset
is left skewed and could benefit from a log transformation.
Log
Transformation
62. Lets utilize an automated feature
selection procedure called stepwise
selection to identify those variables
which are both statistically significant and
add value to the regression model.
This revised model now has all variables
showing statistical significance at the
p<0.05 level.
Additionally, the model now has an
Adjusted R-Square of 73.4% compared
to 67.7% which is a notable
improvement.
63. We checked the VIF to determine
whether multicollinearity is an issue.
All of the values are below 3 which
indicates that this is not an issue.
A review of the QQ Plot indicates that
the data generally agrees with a
normal distribution, however, there are
longer tails at the ends of the
distribution.
A review of the residual plots indicates
the potential need to apply some
transformation of the independent
variable to further improve the model.
QQ Plot
Residual Plot
64. When we performed a log-level
transformation of the data, we now
must interpret the change in x as a
constant percentage increase in y.
Interpretation:
Therefore, each additional room a
house has leads to an increase of the
price by 12%, holding other variables
constant.
If the home is near the river, the price
increases by 13%, holding other
variables constant.
When a house is close to the main
employment centers, the price
decreases by 3% per unit.
Here is the final model that we had produced:
66. A supermarket is selling a new type of grape juice
in some of its stores for pilot testing.
The senior management wants to understand the
relationship between the grape juice and its
impact on apple juice, cookie sales, and
profitability.
We will showcase how it is possible to build off of
linear OLS regression models and econometric
methodologies to solve a series of advanced
business problems.
The goal will be to provide tangible
recommendations from our analyses to help this
business manage their portfolio.
67. Our goal is to setup an experiments to analyze:
Which type of in-store advertisement is more effective? The marketing group has placed two
types of ads in stores for testing, one theme is natural production of the juice, the other
theme is family health caring.
The Price Elasticity – the reactions of sales quantity of the grape juice to its price change.
The Cross-Price Elasticity – the reactions of sales quantity of the grape juice to the price
changes of other products such as apple juice and cookies in the same store.
How to find the best unit price of the grape juice which can maximize the profit and the
forecast of sales with that price?
68. Here is a description of the variables within the dataset:
First, lets take a look at the raw data in the table.
69. From the summary table, we can roughly know the basic statistics of each numeric variable.
For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is
335.
We can further explore the distribution of the data of sales by visualizing the data in graphical
form as follows. We don’t find outliers in the above box plot graph and the sales data
distribution is roughly normal. It is not necessary to apply further data cleaning and treatment
to the data set.
70. The marketing team wants to find out the ad
with better effectiveness for sales between the
two types of ads:
a natural production theme
with family health caring theme.
To find out the better ad, we can calculate and
compare the mean of sales with the two
different ad types at the first step.
The mean of sales with nature product theme is
about 187; the mean of sales with family health
caring theme is about 247.
It looks like that the latter one is better.
71. To find out how likely the conclusion is
correct for the whole population, it is
necessary to do statistical testing – two-
sample t-test.
We can see that both datasets are
normally distributed and to be certain we
can run the Shapiro-Wilk test.
The p-values of the Shapiro-Wilk tests
are larger than 0.05, so there is no strong
evidence to reject the null hypothesis
that the two groups of sales data are
normally distributed.
72. Now we can conduct the Welch two sample t-test since the t-test assumptions are met.
From the output of t-test above, we can say that:
We have strong evidence to say that the population means of the sales with the two
different ad types are different because the p-value of the t-test is very small;
With 95% confidence, we can estimate that the mean of the sales with natural
production theme ad is somewhere in 27 to 93 units less than that of the sales with
family health caring theme ad.
So the conclusion is that the ad with the theme of family health caring is BETTER.
73. With the information given in the data set,
we can explore how grape juice price, ad
type, apple juice price, cookies price
influence the sales of grape juice in a store
by multiple linear regression analysis.
Here, “sales” is the dependent variable and
the others are independent variables.
Let’s investigate the correlation between
the sales and other variables by displaying
the correlation coefficients in pairs.
The correlation coefficients between sales
and price, ad type, price apple, and price
cookies are 0.85, 0.58, 0.37, and 0.37
respectively, that means they all might
have some influences to the sales.
74. We can try to add all of the independent
variables into the regression model:
The p-value for Price, Ad Type, and Price Cookies
is much less than 0.05. They are significant in
explaining the sales. We are confident to include
these variables into the model.
The p-value of Price Apple is a bit larger than
0.05, seems there are no strong evidence for
apple juice price to explain the sales. However,
according to our real-life experience, we know
when apple juice price is lower, consumers likely
to buy more apple juice, and then the sales of
other fruit juice will decrease.
So we can also add it into the model to explain
the grape juice sales.
The Adjusted R-squared is 0.881, which indicates
a reasonable goodness of fit and 88% of the
variation in sales can be explained by the four
variables.
75. The assumptions for the regression to
be true are that data are random and
independent; residuals are normally
distributed and have constant variance.
Let’s check the residuals assumptions
visually.
The Residuals vs Fitted graph above
shows that the residuals scatter around
the fitted line with no obvious pattern, and
the Normal Q-Q graph shows that
basically the residuals are normally
distributed. The assumptions are met.
The VIF test value for each variable is
close to 1, which means the
multicollinearity is very low among these
variables.
76. With model established, we can analysis the Price Elasticity(PE) and Cross-price
Elasticity(CPE) to predict the reactions of sales quantity to price.
Price Elasticity
PE = (ΔQ/Q) / (ΔP/P) = (ΔQ/ΔP) * (P/Q) = -51.24 * 0.045 = -2.3
P is price, Q is sales quantity
ΔQ/ΔP = -51.24 , the parameter before the variable “price” in the above model
P/Q = 9.738 / 216.7 = 0.045
P is the mean of prices in the dataset, Q is the mean of the Sales variable.
Interpretation: The PE indicates that 10% decrease in grape juice price will increase the
grape juice sales by 23%, and vice versa.
Linear Regression Model:
77. Let’s further calculate the CPE on apple juice and cookies to analyze the how the change
of apple juice price and cookies price influence the sales of grape juice.
Cross Price Elasticity
CPEapple = (ΔQ/ΔPapple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78
CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = – 1.12
Interpretation:
The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales of
grape juice by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.
The CPEcookies indicates that 10% decrease in cookies price will INCREASE the grape juice
sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place
the two products together will likely increase the sales for both.
We can also know that the grape juice sales increase 29.74 units when using the ad with
the family health caring theme (ad_type = 1).
78. Usually companies want to get higher profit
rather than just higher sales quantity.
So, how to set the optimal price for the new
grape juice to get the maximum profit based on
the dataset collected in the pilot period and our
regression model?
To simplify the question, we can let the Ad Type
= 1, the Price Apple = 7.659 (mean value), and
the Price Cookies = 9.738 (mean value).
The model is simplified as follows:
Sales = 774.81 – 51.24 * price + 29.74 * 1 + 22.1 *
7.659 – 25.28 * 9.738
Sales = 772.64 – 51.24*price
79. Assume the marginal cost (C) per unit of grape juice is $5.00. We can calculate the profit
(Y) by the following formula:
Y = (price – C) * Sales Quantity = (price – 5) * (772.64 – 51.24*price)
Y = – 51.24 * price2 + 1028.84 * price – 3863.2
To get the optimal price to maximize Y, we can use the following R function.
80. The optimal price is $10.04; the maximum profit will be $1301 according to the above
output. In reality, we can reasonably set the price to be $10.00 or $9.99.
We can further use the model to predict the sales while the price is $10.00.
Additionally, the ad type = 1
Mean Price Apple = 7.659
Mean Price Cookies = 9.738
Sales = 774.81 – 51.24 * Price + 29.74 * Ad Type + 22.08 * Price Apple – 25.27 * Price Cookies
Sales = 774.81 – 51.24 * (10) + 29.74 * (1)+ 22.08 * (7.659) – 25.27 *(9.738)
The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence
in a store in one week on average.
Based on the forecast and other factors, the supermarket can prepare the inventory for all
of its stores after the pilot period.
Linear Regression Model:
81. Reside in Wayne, Illinois
Active Semi-Professional Classical Musician
(Bassoon).
Married my wife on 10/10/10 and been
together for 10 years.
Pet Yorkshire Terrier / Toy Poodle named
Brunzie.
Pet Maine Coons’ named Maximus Power and
Nemesis Gul du Cat.
Enjoy Cooking, Hiking, Cycling, Kayaking, and
Astronomy.
Self proclaimed Data Nerd and Technology
Lover.