Linear Regression Preliminaries

  • Notation
  • Residual sum of squares (RSS)

Notation (cont.)

Residual Sum of Squares

\(e_i\) will represent the residual, \[ e_i =\text{ data } - \text{ prediction } = y_i - \beta_0 - \beta_1 x_i \]

  • Goal: Minimize the error between prediction and data!
  • The residual sum of squares is defined as \[ \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x)^2 \]

Residual Sum of Squares

The ordinary least squares line solves the following problem \[ \begin{align*} \underset{\beta_0, \beta_1 \in \mathbb{R}}{\min} \sum_{i=1}^{n} \left( y_i - \beta_0 - \beta_1 x\right)^2 \end{align*} \]

Q: I have 100 feet of fence material and I want to maximize the area the fence encloses.

Note: the property has a river on one side. What is the length and width of the fence?

RSS (continued)

We can apply the same technique as the fence problem and differentiate with respect to one variable at a time. \[ \begin{align*} \frac{\partial \texttt{RSS}(\beta_0, \beta_1)}{\beta_0} &= -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0\\ \frac{\partial \texttt{RSS}(\beta_0, \beta_1)}{\beta_1} &= -2 \sum_{i=1}^{n} x_i (y_i - \beta_0 - \beta_1 x_i) = 0.\\ \end{align*} \]

RSS (continued)

Rearranging terms, we obtain \[ \begin{align*} \beta_0 n + \beta_1 \sum x_i &= \sum y_i\\ \beta_0 \sum x_i + \beta_1 \sum x_i^2 &= \sum x_i y_i, \end{align*} \] which are called the normal equations for the simple linear regression model.

Practice

Class Activity: For the following data, plot the data and calculate \(\hat{\beta}_0\) and \(\hat{\beta}_1\),

(1,6)

(2,5)

(3,7)

(4,10)

Implement in R

# clear variables
rm(list=ls())

# create fake data
x <- c(1,2,3,4)
y <- c(6,5,7,10)

# call a summary for linear model (lm.xy)
summary(lm.xy)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    1    2    3    4 
##  1.1 -1.3 -0.7  0.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3.5000     1.7748   1.972    0.187
## x             1.4000     0.6481   2.160    0.163
## 
## Residual standard error: 1.449 on 2 degrees of freedom
## Multiple R-squared:    0.7,  Adjusted R-squared:   0.55 
## F-statistic: 4.667 on 1 and 2 DF,  p-value: 0.1633

# verify the residual standard error
myrss = sum(lm.xy$residuals^2) # this calculates the rss
myerr = sqrt(myrss/2) # sqrt of variance sigma^2/ df

myrss
## [1] 4.2
myerr
## [1] 1.449138

RSS (cont.)

Solving for \(\beta_0\) and \(\beta_1\) in general, we have \[ \hat{\beta}_0 = \bar{y} - \hat{\beta_1} \bar{x}, \quad \hat{\beta}_1 = \frac{\texttt{SXY}}{\texttt{SXX}} \]