Side-by-Side Boxplots, Standard Deviation, and Empirical Rule
Box Plots
A box plot is a graphical diagram consisting of the max, min, \(Q_1\), \(Q_3\), and median. It is drawn as follows:
- The middle line in the box is the median.
- The edges of the box are the values Q1 and Q3.
- The edge lines or whiskers are at the maximum and minimum values.
Modified Boxplots
A modified boxplot is a boxplot which has the same box (Q1, median, and Q3), however the position of the whiskers are different:
- The whiskers go out to the point before the outliers. Of course, this is somewhat subjective.
- The outliers are shown as dots.
Purpose of Boxplots
- Easy to compare two or more sets of data
- The points plotted individually (outliers) are still part of the data set and cannot be ignored
Problem with boxplots: You can't confirm the shape of the distribution, and modes in particular. Histograms show shapes better. You can somewhat see skewness, but it's hard.
Standard Deviation
- A measure of the spread of the observations from the mean.
- Roughly, the average distance of the observations from the mean
$$\sqrt{\frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dots (x_n - \bar{x})^2}{n-1}} = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}$$
Where \(n\) is the sample size, \(x_i - \bar{x}\) is the distance of \(x_i\) from the sample mean.
- Instead of using absolute value for the distance from the mean, which is used in some business applications, the values are squared because it has some nice theoretical properties.
- \(n-1\) is the degrees of freedom.
The variance, or the square of the standard deviation, has some nice theoretical properties.
Some Notes about Standard Deviation
- \(s=0\) means that all the observations are the same. That means that there is no variability.
- Like the mean, the standard deviation is also sensitive to extreme observations.
- Use the mean and standard deviation for reasonably symmetric distributions, including bell-shaped distributions.
- The five-number summary is better for skewed distributions or if there are outliers.
Empirical Rule (68-95-99.7 rule)
- 68% of values fall within 1 standard deviation of the mean in either direction.
- 95% of values fall within 2 standard deviations of the mean in either direction.
- 99.7% of values fall within 3 standard deviations of the mean in either direction.
Standard Score or z-Score
The distance between the observed value and mean, in terms of the number of standard deviations:
$$z = \frac{\text{observed value - mean}}{\text{standard deviation}}$$
Values that are above the mean have positive z-scores, and values that are below the mean have negative z-scores. This is a useful "yardstick" that says how far a value lies from the mean.