STATS 250

Prediction Intervals and Confidence Intervals for Linear Regression

Importantly, you can't test one-sided alternative hypotheses with an ANOVA F-test for regression. With the t-test, you could test not equal, greater than, or less than hypotheses, but with the F-test you can't.

Now, we're going to look at confidence intervals for predictions. There are two scenarios:

  • How would you test the prediction for a single data point?
  • How would you test the prediction for a mean?

First, when looking at a single point estimation. You have $$\hat{y} = b_1x + b_0$$, and you also have an estimated standard deviation $$s$$ for the regression.

$$s = \sqrt{\text{MSE}}$$

When testing for a mean, you use the expected value:

$$E[\hat{y}] = b_1x + b_0$$

Although this looks really similar, the standard errors are different. When looking at a mean, you have less error. When looking at an individual, you have more error.

Predicting for Individuals vs Mean

For individuals, the standard deviation is just:

$$\sigma$$

For populations, the standard deviation is:

$$\frac{\sigma}{\sqrt{n}}$$

You can construct a prediction interval for an individual response, and this interval will be wider than a confidence interval for a mean response. Notice the change in nomenclature.

Confidence Interval for a Mean Response

$$\hat{y} \pm t^* \text{s.e.}(\text{fit})$$

$$\text{s.e.}(\text{fit}) = s\sqrt{\frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2}}$$

Prediction Interval for an Individual Response

$$\hat{y} \pm t^* \text{s.e.}(\text{pred})$$

$$\text{s.e.}(\text{pred}) = \sqrt{s^2 + (\text{s.e.}(\text{fit}))^2}$$

Notice that this standard error is the norm of $$s$$ and the fit standard error.

Epsilon Error

Note that there is randomness in the linear relationship:

$$y = \beta_0 + \beta_1x + \epsilon$$

Where $$\epsilon$$ is just a normal distribution with mean 0 and constant variation of standard deviation $$\sigma$$.

Assumptions

  • Relationship is in fact linear.
  • Errors should be normally distributed.
  • Errors should have constant variance.
  • Errors should not display obvious patterns.

We cannot observe the actual $$\epsilon$$.

Testing Linearity

Examine a scatterplot of $$y$$ versus $$x$$.

Testing that errors are normally distributed

Use a QQ plot! (Or a histogram)

Testing distribution of errors

In a residual plot, it will center at 0 and have x values and residuals. You should check for random scatter.

A funnel shape would have evidence for non-constant variance, or heteroscedasticity. Who the hell came up with that word?