MATH 106 - Applied Linear Statistical Models

Last Time in Math 106 \(\ldots\)

Scatterplots
Residuals, Mean, and Variance Functions
Introduction to R

Variance Functions

Another part of the distribution of \(Y\) is described by the variance function ,

\[ \text{Var}(Y \;|\; X = x) \]

A frequent assumption in fitting linear regression models is that the variance function is the same for every value of x.

\[ \text{Var}(Y \;|\; X = x) = \sigma^2 \]

This is usually done for convenience but we will discuss general variance models in Ch. 7.

Summary Graph

A summary graph is a scatterplot of \(Y\) versus \(X\).

Q: Why should you take time to explore these graphs?

First step in exploring the relationships of variables.
Even if not appropriate, you can estimate find the best-fit line \(\beta_0 + \beta_1 x\)
Anscombe (1973) simulated data in anscombe.

Anscombe Data

Q: How do we determine outliers?

Smallmouth Bass

Smallmouth Bass (cont.)

What is different about this particular data?
Can you predict the length of a smallmouth bass at age 4?

This data is from a cross-sectional study, as a opposed to a longitudinal study, where one would keep track of the age and length of the same fish over time.

Uncorrelated Variables (Ex: Weather)

For uncorrelated, variables that do not show a positive or negative association, data, we will need to conduct appropriate statistical tests to check for difference.

Plot of snowfall from 1900-1992 (in.). The dashed line is the ols line.

Tools for looking at scatterplots

Size - changing or resizing scales
Transformations - transforming either \(X\) or \(Y\) so summary graph is more appropriate. This will usually be a log transform \(X^\lambda\).
Smoothers for the Mean Function - for this class, only for visual purposes

Scatterplot Matrices

With one predictor, a scatterplot provides a summary of the regression relationship between \(X\) and \(Y\).
With many predictors, we need to look at many scatterplots.
A scatterplot matrix is a convenient way to organize these plots.

Fuel Consumption

Let’s look at the fuel2001 data in R,

myfuel <- fuel2001
# generate a summary of all columns in data
summary(myfuel)

##     Drivers             FuelC              Income          Miles       
##  Min.   :  328094   Min.   :  148769   Min.   :20993   Min.   :  1534  
##  1st Qu.: 1087128   1st Qu.:  737361   1st Qu.:25323   1st Qu.: 36586  
##  Median : 2718209   Median : 2048664   Median :27871   Median : 78914  
##  Mean   : 3750504   Mean   : 2542786   Mean   :28404   Mean   : 77419  
##  3rd Qu.: 4424256   3rd Qu.: 3039932   3rd Qu.:31208   3rd Qu.:112828  
##  Max.   :21623793   Max.   :14691753   Max.   :40640   Max.   :300767  
##       MPC             Pop                Tax       
##  Min.   : 6556   Min.   :  381882   Min.   : 7.50  
##  1st Qu.: 9391   1st Qu.: 1162624   1st Qu.:18.00  
##  Median :10458   Median : 3115130   Median :20.00  
##  Mean   :10448   Mean   : 4257046   Mean   :20.15  
##  3rd Qu.:11311   3rd Qu.: 4845200   3rd Qu.:23.25  
##  Max.   :17495   Max.   :25599275   Max.   :29.00

** Q: Why should Fuel and Dlic be included in this data?

Fuel Consumption (cont.)

# create a scatterplot matrix of fuel data
plot(myfuel)

Linear Regression Preliminaries

Notation
Residual sum of squares (RSS)

Notation

The symbol \(\text{E}(u_i)\) is read as the expected value of the random variable \(u_i\).
The phrase “expected value” is the same as the phrase “mean value.”
Informally, the expected value of \(u_i\) is the average value of a very large sample drawn from the distribution of \(u_i\)

Q: What is the expected value of rolling a die?

Expected Value

The expected value is a linear operator, which means
\[ \begin{align*} \text{E}(a_0 +a_1u_1) &= a_0 +a_1 \text{E}(u_1)\\ \text{E}\left( a_0 +\sum a_i u_i\right) &= a_0 +\sum a_i \text{E}(u_i), \end{align*} \] where \(a_0, a_1\) are constants and \(u_i\) are random variables.

Class Activity: Using this information, show that the expected value of the sample mean is equal to the population mean.

Variance

The variance is defined by the equation \[ \text{Var}(u_i) = \text{E}[u_i - \text{E}(u_i)]^2,\] the expected squared difference between an observed value for \(u_i\) and its mean value. For uncorrelated random variables, \[ \text{Var}\left(a_0 + \sum a_i u_i \right)= \sum a_i^2 \text{Var} (u_i). \]

Note: The variance of a constant is zero.