STATS 250

Relationships Between Categorical Variables

In the past few lectures, we were looking at the relationship between two quantitative variables and in particular, the linear relationship between quantitative variables.

Now we're looking at categorical variables. There are three basic tests:

Goodness of fit test: This test is for assessing if a particular discrete model is a good fitting model for a discrete characteristic, based on a random sample of the population
Test of Homogeneity: This test is for assessing if two or more populations are homogeneous (alike) with respect to the distribution of some discrete (categorical) variable.
Test of Independence: This test helps us assess if two discrete (categorical) variables are independent for a population or if there is an association between the two variables.

Chi Square Distribution

All three tests are based on a $$\chi^2$$ test statistic. If $$H_0$$ is true, and assumptions hold, $$\chi^2$$ follows a chi-square distribution.

In a chi-square distribution with df = degrees of freedom, then:

Mean is equal to degrees of freedom
Variance is equal to twice the degrees of freedom
Standard devation is equal to the square root of the variance
Median is the point where 50% of the values come after it (found with a chi square table)

Big Idea about Chi Square Tests

We're trying to compare "does this seem to fit". If we're going to fit a certain model, we expect certain numbers of counts of each category. We have observed counts. How do these observed counts compare to what we expected?

We look at these differences from an expected number of 0, square it to make it all positive, and rescale it.

Test of Goodness of Fit

This is used to assess if a particular discrete model is a good fitting model for a discrete characteristic, based on a random sample from the population.

1 population
1 random sample
1 response which is categorical or discrete

Let's say you have four toll booths, and you see 100 cars come through total. You would expect each to get 25 cars through each. Let's say instead you had:

$$(26, 20, 28, 26)$$

The chi square value for this is:

$$\chi^2 = \sum_{\text{all cells}} \frac{\text{(Observed - Expected)}^2}{\text{Expected}} = \frac{(26 - 25)^2}{25} + \frac{(20 - 25)^2}{25} + \frac{(28 - 25)^2}{25} + \frac{(26 - 25)^2}{25} = 1.44$$

This means you have an observed value of 1.44, with df of k - 1 = categories - 1 = 3.