STATS 250

Lecture 2: Distribution and Visualization of Data

Summaries of Data

When describing data, numerical summaries and graphical summaries can be performed.

Common Numerical Summaries

  • Frequency (i.e., count)
  • Relative frequency (i.e., percentage from 0 to 1)
  • Extrema (i.e., smallest/largest values)
  • Mean, median, mode, standard deviation, and many more

Common Graphical Summaries

Note that for plotting quantitative data, the data can be segmented into regions \([a, b)\) where \(a\) is included in the region and \(b\) is not included in the region.

  • Histogram (i.e., bar chart)
    • For categorical data, the order of categories does not matter
    • Make sure to include a data table
  • Pie chart
    • Not as easy to compare sizes of categories, especially with a large number of them

Describing Distribution of Data

Distribution of data comprises of the values a variable can take, and the proportion of time the variable takes those values.

A histogram tells us:

  • Overall pattern
    • Is it symmetric? Skewed? Bell-shaped? Uniform?
      • Skewed right has most of the data on the left, and vice versa
    • The modality (how many peaks does the data have?)
  • Location of (center, average)
    • 50% point or balance point
  • Variability
    • Range (max - min)
    • Standard deviation
    • Variance
  • Deviations (unusual features)
    • Outliers
    • Should not be discarded without justification