Lecture 2: Distribution and Visualization of Data
Summaries of Data
When describing data, numerical summaries and graphical summaries can be performed.
Common Numerical Summaries
- Frequency (i.e., count)
- Relative frequency (i.e., percentage from 0 to 1)
- Extrema (i.e., smallest/largest values)
- Mean, median, mode, standard deviation, and many more
Common Graphical Summaries
Note that for plotting quantitative data, the data can be segmented into regions \([a, b)\) where \(a\) is included in the region and \(b\) is not included in the region.
- Histogram (i.e., bar chart)
- For categorical data, the order of categories does not matter
- Make sure to include a data table
- Pie chart
- Not as easy to compare sizes of categories, especially with a large number of them
Distribution of data comprises of the values a variable can take, and the proportion of time the variable takes those values.
A histogram tells us:
- Overall pattern
- Is it symmetric? Skewed? Bell-shaped? Uniform?
- Skewed right has most of the data on the left, and vice versa
- The modality (how many peaks does the data have?)
- Location of (center, average)
- 50% point or balance point
- Variability
- Range (max - min)
- Standard deviation
- Variance
- Deviations (unusual features)
- Outliers
- Should not be discarded without justification