Mr. Meinzen - AP Statistics Terminology

"Success is the ability to go from one failure to another with no loss of enthusiasm." Winston Churchill

Chapter 1 Terminology : Exploring Data

Fathom Software

  • Toolbar
  • Formula Editor
  • Data :
    • Collection
    • Case or Cases (i.e. individuals)
    • Attribute (i.e. variable)
    • Value
  • Viewing the Data:
    • Inspection Window
    • Table (or Case Table)
    • Graph (or Plot) - bar, pie, histogram, dotplot, stemplot
    • Summary Table or Chart

Data & Distributions : shape, center, spread

  • univariate data ("attributes" in Fathom software)

    • quantitative variable
    • categorical (or qualitative) variable
  • SHAPE of distribution

    • symmetric
      • uniform (or rectangular) distribution
      • normal distribution
    • skewness
      • left skewed distribution
      • right skewed distribution
    • other
      • bimodal distribution
  • CENTER [summary statistic: single value measurement or computation]

    • mean
    • median
    • mode (only for bimodal)
  • SPREAD [summary statistic: single or multiple valued measurement or computation]

    • deviation (or residue)
    • for means : standard deviation (s) or variance (s2)
    • for medians : quartiles (ranges)
      • 5-number summary : min, Q1, Q2 (median), Q3, max
      • interquartile range (IQR)

Plots

  • boxplot (and modified boxplot) : outliers : Q1 - 1.5*IQR or Q3 + 1.5IQR
  • symmetric versus skewed (right or left)
  • cumulative frequency graph (or "Ogive")
  • time plot

 

Linear Transformations : effects on mean, spread and change of units

 

 

Context Clues

  • statistical use of a "count" :
    • Mathematics : Generally, a "count" is a whole number (0, 1, 2, 3, ...). However, it may be a real-valued number if the the item being "counted" is grouped in equal-sized units such as 12 bottles of soda in a case.
    • English : each numeric value must have units such as "oranges" or "individuals" or "cases of soda"
    • Examples : 20 oranges, 17 people, 22.5 cases of soda
  • statistical use of a "proportion" :
    • Concept : part / total (i.e. percentage)
    • Mathematics : real-valued ratio between 0 and 1 (inclusive)
    • English : same units divide by same units (i.e. unit-less quantity)
    • Example : In a class of 25 students with 10 girls and 15 boys, the proportion of girls in the classroom is 10/25 (i.e. 40%). NOTE: Technically, the units are"students per students" but this does not need to be specified.
  • statistical use of a "mean" :
    • Concept : the (arithmetic) average value of a set of numbers
    • Mathematics : ( Σ xi) / n = sum of numbers / count of numbers
    • English : units divided by a count (i.e. the mean has the same units as the numbers themselves)
    • Example : In a class of 5 students, the points on a given quiz are 43, 47, 50, 40, and 44. The mean is (43+47+50+40+44) / 5. = 44.8 points/student. NOTE: The units are"points per student" and must be specified.

Chapter 3 Terminology : Relationships

 

Chapters 1 & 2: One Variable

univariate data :

shape -> center -> spread

Chapter 3: Two Variables

bivariate data:

shape -> trend -> strength -> variability

Key Idea

Distribution

Relationship (association) :

x = explanatory variable = independant variable

y = response variable = dependant variable

Plots/Graphs

Dot plot
Stemplot
Boxplot
Histogram

ScatterPlot

Shape

Normal, uniform, or skewed

Symmetric

Clusters, gaps, and outliers

Linear or curved

constant strength

clusters, gaps, outliers (in one variable), influential point (either or both variables)

Ideal Shape

Normal

Linear (oval/ellipse) : positive or negative direction

Measure of Center

Mean

Median

Regression Line (LSRL)

influential point

Measure of Spread
from the Center

Standard Deviation

Interquartile Range

Correlation, r

  • bivariate data (from Chapter 3): shape -> trend -> strength -> variability

    • plausible explanation : causation, common response, or confounding
    • lurking variable
    • residual plots
    • outliers
  • scatter plots : shape -> trend -> strength -> variability

    1. data's shape : linear, curved, or none
      • y = a1 + b1*x [equivalent to algebra equation of line : y = mx + b]
    2. shape's trend : positive slope, negative slope, or none
      • b1 : measure of slope
    3. trend's strength : strong trend (tight cluster), moderate trend (some clustering), or weak trend (no cluster)
      • correlation : a measure of a trend's strength
    4. strength's variability : uniform or heteroscedasticity (fan-shaped)
  • summary line

    • Least Squares Regression Line (also called LSRL)
    • Line of Best Fit (best guess or LSRL)
    • Regression Line (LSRL)
    • Trend Line (LSRL)
    • Fitted Line (best guess)
  • statistical (math or calculator) symbols and terms

    • randomNormal ( mean, SD )
    • least squares regression line ("regression line" or just LSRL) : y = a1 + b1*x
    • explanatory variable or predictor variable, x
    • response variable or observed variable, y
    • slope, b1 = r* (sy/sx)
    • y-intercept, a1
    • predicted value, ŷ
    • interpolations, extrapolation
    • influential point (outlier)
    • residual = y-ŷ = observed y - predicted y
    • sum of square errors (SSE)
    • r : correlation coefficient
    • r2 : coefficient of determination
    • "correlation does not imply causation" due to lurking variable

Chapter 4 Terminology : Two-Variable Data

  • transformations

    • linear : y = a + bt
      • increase is fixed amount from previous values
    • exponential growth model : y = abt
      • increase is fixed percentage from previous values
      • plot "log y against x" to see linear pattern
    • power transformations : y = atb
      • use "log y against log x" graph to see linear pattern
      • square, p=2, y = t2
      • reciprocal, p=-1, y = 1/t
      • reciprocal square root, p=-1/2, y = t-1/2
      • logarithm, p=0, y = log(t)
  • correlation and regression interpretation

    • "correlation does not imply causation" due to lurking variable
    • using LSRL for interpolation and extrapolation
    • causation, common response, confounding
    • categorical data (two-way tables and bar graphs)
      • for counts and percentages within categories
      • marginal distributions (single categorical variable)
      • conditional distributions
      • Simpson's Paradox (lurking variable as a pre-condition)

Chapter 5 Terminology : Producing & Exploring Data

  • units (and population size)
  • population census versus sample of population
  • population parameter versus sample statistic
  • samples - unbiased representatation of population [random and independent]

    • simple random sample (SRS) : each sample as an equal probability of being selected. Use of table of random digits
    • stratified random sample : groups (strata) of similar individuals, SRS within each strata
    • cluster sample
    • two-stage (or multi-stage) cluster sample
    • systematic sample with random start
  • sample bias

    • selection bias
    • size bias
    • volunteer bias
    • convenience bias
    • judgement bias
    • undercoverage : certain groups left out of process of choosing sample
    • nonresponsive bias
    • questionnaire bias
    • wording or language bias
    • incorrect response bias
  • Experiment vs Observational Study

    • designs of experiments :
      • experimental units->subjects->treatment (factor vs plecebo)->observed response
      • 1. control; 2. randomize; 3. replicate
      • blind experiment, double blind experiment
      • completely randomized design
      • randomized matched paired design (2 treatments)
      • randomized block design
    • variables in experiments
      • explanatory variable or factor (if catagorical)
      • response variable
      • lurking variable
      • confounding variable
    • variability
      • between-treatment
      • within-treatment
    • statistically significant
    • lack of realism
  • Simulation of Experiments : imitation of chance behavior based on an accurate model of the experiment

    1. Describe the experiment or state problem
    2. State assumptions
    3. Assign digits to represent outcomes (random number table)
    4. Simulate many repetitions
    5. State conclusions

Chapter 6 Terminology : Probability : a study of randomness

  • event
  • table of random digits
  • venn diagrams
  • Probability

    • event, P(A)
    • complement of event, P(Ac)
    • distribution
    • model
    • mutually exclusive [disjoint categories]
    • conditional events
    • independent events
    • "of at least one"
    • variance & standard deviation
  • Sampling

    • space
    • with Replacement
    • without Replacement
  • Mathematical

    • Law of Large Numbers
    • Fundamental Counting Principle
    • Terminology with Probabilities
      • Legal Values : 0 ≤ P(A) ≤ 1 for any Event A
      • Total Probability = 1 : P(all sample space) = 1 : sum total probability of all sample space is 1.00 (100%)
      • Complement of Event : P(Ac) = 1 - P(A)
        • P("of at least one") = 1 - P("exactly none")
      • Mutually Exclusive Events [i.e. Disjoint] : P(A and B) = 0
      • Independent Events : P(A|B) = P(A) also P(A and B) = P(A)*P(B))
      • Conditional Events : P(A|B) = P(A and B) / P(B) often re-written as P(A and B) = P(A|B) * P(B)
    • Addition Rules (aka Union) ["or"]
      • Full Rule : P(A or B) = P(A) + P(B) - P(A and B)
      • Simplified Rule for Mutually Exclusive Events [Disjoint] : P(A or B) = P(A) + P(B)
    • Multiplication Rules (aka Intersection) ["and"]
      • Full Rule : P(A and B) = P(A) * P(B|A) can also be re-writen as P(A and B) = P(B) * P(A|B)
      • Simplified Rule for Independent Events : P(A and B) = P(A)*P(B)
    • Bayes' Rule
    • Tree Diagrams

Chapter 7 Terminology : Random Variables : numeric outcome of a random phenomenon

  • Random Variable, X
    • Law of Large Numbers
    • Discrete (integer values only for X)
      • P(X=a) : probability of the random variable being a fixed value; height of a probability histogram bar
    • Continuous (any real value for X)
      • P(X<a) : probability of the random variable being less than (or equal to) fixed value; area under the curve from left side to a [reading left to right]
      • P(X>a) : probability of the random variable being greater than (or equal to) fixed value; area under the curve from a to right side [reading left to right]
      • P(a<X<b) : probability of the random variable being between than (or equal to) two fixed values; area under the curve between a and b.
  • Distributions
    • Probability Distributions for a Random Variable, X
      • mean, ux : sometimes misleadingly called the Expected Value, E(X)
      • standard deviation, σx
      • variance, σx2 : correlation for independant random variables = 0
      • linear combinations of random variables : ux+y = ux + uy; σx+y2 = σx2 + σy2
      • from Collected Data
        • using Known Data Frequencies that model Your Situation
        • simulation using Random selection from a known data
      • from Theory
        • assumptions + Basic Mathematical Principles

Chapter 8 Terminology : Binomial and Geometric Distributions

  • Binomial Distributions : Discrete Random Variable with only 2 categories (i.e. "success" or "failure")
    • n = fixed number of independant trials, must be known
    • p = probability of success on any one trial, must be the same for each trial, must be known
    • 1-p = q = probability of failure on any one trial
    • P(X = k) = nCk pk (1 - p)n - k
    • P(X > n) = 1 - P(X ≤ n)
    • probability distribution function at X="number of success":
      • binompdf(number of trials, probability of success, number of successes)
    • cumulative (i.e. sum of ) probability distribution function for 0 ≤ X ≤ "number of successes" = area under curve
      • binomcdf(number of trials, probability of success, number of successes)
  • Statistics of a Binomial Distribution
    • u = np
    • σ = √[np(1-p)]
    • Normal approximation to Binomial dDistribution ~ BINS (binomial, independant, number of trials is fixed, success probabilities is known)
      • N(np, √[np(1-p)] ) : used only if np ≥ 10 and n(1-p) ≥ 10
  • Geometric Distributions : Discrete Random Variable with only 2 categories (i.e. "success" or "failure")
    • n = number of independant trials (not fixed)... we are trying to determine this number of trials before we get our first "success"!
    • p = probability of success on any one trial, must be the same for each trial, must be known
    • 1-p = q = probability of failure on any one trial
    • P(X = n) = (1-p)n -1 p1
    • P(X > n) = 1 - P(X≤n)
    • probability distribution function at X="number of success":
      • binompdf(number of trials, probability of success, number of successes)
    • cumulative (i.e. sum of ) probability distribution function for 0 ≤ X ≤ "number of successes" = area under curve
      • binomcdf(number of trials, probability of success, number of successes)
  • Statistics of a Geometric Distribution
    • u = 1/p
    • σ = √(1-p) / p

Chapter 2 Terminology : Normal Distributions

  • probability density curve of population (all possible) data

    • total area under pd curve = 1
    • visually determine : median (equal area point), mean (balance point)
    • skewed : mean further toward tail than median
    • percentages calculated from probability density curve
  • normal probability curve (i.e. bell curve) from population data
    • Generally :
      • u : population mean [also the formula to calculate]
      • σ : population standard deviation, [also the formula to calculate]
      • N(u, σ ) : Normal (bell) curve [aka Probability Density Curve]
    • Standardized :
      • z-score as a standard point on normal curve [also the formula to calculate as well as re-centering and re-scaling]
      • N(0,1) is the Standard Normal curve (i.e. Z-curve)
    • 68-95-99.7 rule
    • Determine normality from plot of histogram, stemplot and/or boxplot
  • statistical (math or calculator) symbols and terms from sample data

    • x-bar : sample mean as an estimate for ux [also the formula to calculate]
    • s : sample standard deviation as an estimate for σx [also the formula to calculate]
    • normalcdf ( leftbound, rightbound, mean, standard deviation)
    • z = invNorm ( area, mean, SD )