STATS 250

Side-by-Side Boxplots, Standard Deviation, and Empirical Rule

Box Plots

A box plot is a graphical diagram consisting of the max, min, \(Q_1\), \(Q_3\), and median. It is drawn as follows:

Box Plot

  • The middle line in the box is the median.
  • The edges of the box are the values Q1 and Q3.
  • The edge lines or whiskers are at the maximum and minimum values.

Modified Boxplots

A modified boxplot is a boxplot which has the same box (Q1, median, and Q3), however the position of the whiskers are different:

  • The whiskers go out to the point before the outliers. Of course, this is somewhat subjective.
  • The outliers are shown as dots.

Purpose of Boxplots

  • Easy to compare two or more sets of data
  • The points plotted individually (outliers) are still part of the data set and cannot be ignored

Problem with boxplots: You can't confirm the shape of the distribution, and modes in particular. Histograms show shapes better. You can somewhat see skewness, but it's hard.

Standard Deviation

  • A measure of the spread of the observations from the mean.
  • Roughly, the average distance of the observations from the mean

$$\sqrt{\frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dots (x_n - \bar{x})^2}{n-1}} = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}$$

Where \(n\) is the sample size, \(x_i - \bar{x}\) is the distance of \(x_i\) from the sample mean.

  • Instead of using absolute value for the distance from the mean, which is used in some business applications, the values are squared because it has some nice theoretical properties.
  • \(n-1\) is the degrees of freedom.

The variance, or the square of the standard deviation, has some nice theoretical properties.

Some Notes about Standard Deviation

  • \(s=0\) means that all the observations are the same. That means that there is no variability.
  • Like the mean, the standard deviation is also sensitive to extreme observations.
  • Use the mean and standard deviation for reasonably symmetric distributions, including bell-shaped distributions.
  • The five-number summary is better for skewed distributions or if there are outliers.

Empirical Rule (68-95-99.7 rule)

  • 68% of values fall within 1 standard deviation of the mean in either direction.
  • 95% of values fall within 2 standard deviations of the mean in either direction.
  • 99.7% of values fall within 3 standard deviations of the mean in either direction.

Standard Score or z-Score

The distance between the observed value and mean, in terms of the number of standard deviations:

$$z = \frac{\text{observed value - mean}}{\text{standard deviation}}$$

Values that are above the mean have positive z-scores, and values that are below the mean have negative z-scores. This is a useful "yardstick" that says how far a value lies from the mean.