Google doc Cheat sheet with stuff not on Formula Sheet here
From Simple Studies, **https:%%//%%simplestudies.edublogs.org** & @simplestudies4 on Instagram
Statistics: The science of data
Data always involves individuals and variables
There are two varieties of variables:
Distribution tells us what values a variable takes and how frequently it takes these values
How to go from Data Analysis to Inference:
A Two-way Table describes two categorical variables, organizing counts according to a row variable and a column variable
Source:
https:%%//%%www.statology.org/conditional-relative-frequency-two-way-table/
The Marginal Distribution of one of the categorical variables is the distribution of values of that variable among all individuals described by the table
● Ex: Marginal distribution of gender: Male: 48/100 = 48% Female: 52/100 = 52%
These are the steps to take to examine a marginal distribution:
A Conditional Distribution of a variable describes the values of that variables among individuals who have a particular value of another variable
Here are the steps to take to examine or compare conditional distributions:
When describing distribution of quantitative data, we use the acronym SOCCS
Stem-and-Leaf Plots are a simple graphical display for small sets of data
Source:
https:%%//%%en.wikipedia.org/wiki/Stem-and-leaf_display
These are the steps on how to make a Stem-and-Leaf Plot:
Histograms are graphs that display the distribution of a quantitative variable by showing each interval of the values as a bar
Source: https:%%//%%online.stat.psu.edu/stat500/book/export/html/539
These are the steps to take on how to construct a histogram:
The median is the midpoint of the distribution
These are the steps to take to find the median:
The mean is the average of all individual data values
These are some observations you should look at to determine if you should use the mean or median to measure the center of a distribution of data:
These are the steps to take to calculate quartiles:
The standard deviation - average distance between each value and the mean
A five-number summary is a quick summary of the distribution of a data set
Source: https:%%//%%www.simplypsychology.org/boxplots.html
Percentile: The nth percentile of a distribution is the value with n percent of the observations less than it
Adding or subtracting the same number n to each observation:
The z-score tells us how many standard deviations away from the mean an observation falls, and what direction it falls in
When data has a regular overall pattern, we can use a simplified model called a density curve to describe it
Normal distributions are often shown in Normal curves
A normal curve is described by its mean and standard deviation
The mean of a normal distribution is at the center of the normal curve
It is the same as the median
The standard deviation is the distance from the center to the change-of-curvature points on either side
Source:
http:%%//%%www.stat.yale.edu/Courses/1997-98/101/normal.htm
The Empirical Rule: In the normal distribution with mean m and standard deviation s:
Source: http://stevegallik.org/cellbiologyolm_statistics.html
The Standard Normal Distribution is the normal distribution with mean 0 and standard deviation 1
Source: https://statistics-
made-easy.com/standard-normal-distribution/
We use Table A to find the proportion of observations in a standard normal distribution that satisfies each z-score:
We can also use the calculator to find the proportion of observations in a standard normal distribution that satisfies each z-score:
A normal probability plot provides a good assessment of the adequacy of the normal model for a set of data
Source:
https:%%//%%mathcracker.com/normal-probability-plot-maker
When analyzing two or more variables, there are two types you should keep in mind:
When examining the relationship between variables, these steps should be taken:
Source:
https:%%//%%www.mathsisfun.com/data/scatter-xy-plots.html
For a linear association between two quantitative variables, the correlation ® measures both the direction and strength of the association
A regression line displays the relationship between two variables, but only when one of the variables helps explain or predict the other
Source: https:%%//%%learningstatisticswithr.com/book/regression.html
A regression line relating y to x has the equation ŷ = a + bx
The Coefficient of Determination measures the percent of the variability in the response variable that is accounted for by the least-square regression line
A residual is the difference between the actual value of y and the predicted value of y by the regression line
Source:
https:%%//%%www.statisticshowto.com/least-squares-regression-line/
Residual Plot: A scatter plot that displays the residuals on the vertical axis and the explanatory variable on the horizontal axis
Source: https:%%//%%opexresources.com/analysis-residuals-explained/
Here are some vocabulary terms regarding sampling and surveys:
These are the different types of sampling designs:
These are the different types of bias:
Observational studies of the effect of one variable on another often fail because of these reasons:
These are some vocabulary terms that deal with experiments:
The three principles of experimental design are:
Probability: any outcome of chance process is a number between 0 and 1 that describes the proportion of times the outcome would occur in a series of repetitions
Law of Large numbers: If we observe more and more repetitions of any chance process, the proportion of times that a specific outcome occurs approaches its probability
Probability Model: A description of some chance process that consists of two parts: a list of all possible outcomes and the probability for each outcome.
If all outcomes in the sample size are equally likely, the probability that event A occurs can be found using this formula:
Basic Rules of Probability:
Two events are mutually exclusive if they have no outcomes in common and can never occur together
If A and B are any two events resulting from some chance process, the general addition rule says that:
Intersection: The event “A and B” is called the intersection of events A and B
Union: The event “A or B” is called the union of events A and B
Conditional Probability: The probability that one event happens given that another event is known to have happened is called a conditional probability
Independent: Two events are independent if the occurrence of one event has no effect on the chance that the other will happen
General Multiplication Rule: For any chance process, the events A and B both occur can be found using the general multiplication rule:
Tree Diagram: Shows the sample space of a chance process involving multiple stages
Source:
https:%%//%%www.onlinemathlearning.com/probability-tree-diagrams.html
If A and B are independent events, the probability that A and B both occur is:
Discrete Random Variable: Takes a fixed set of possible values with gaps between them
Continuous Random Variable: Can take any value in an interval on the number line
For any two random variables X and Y, if S = X + Y, the mean of S is:
For any two random variables X and Y, if D = X - Y, the mean of D is:
For any two independent random variables X and Y, if S = X + Y, the variance of S is:
For any two independent random variables X and Y, if D = X - Y, the variance of D is:
A binomial setting arises when we perform n independent trials of the same chance process and count the number of times that a particular outcome (a success) occurs. It must pass these conditions:
The variable X = the number of successes is called a binomial random variable To find the probability of exactly k successes: binompdf (n, p, k)
If a count of X successes has a binomial distribution with n number of trials and p probability of success:
When taking an SRS of size n from a population of size N, we can use a binomial distribution to model the count of success in the sample as long as:
As the number of trials increases, the binomial distribution gets closer to a normal one
A geometric setting arises when we perform independent trials of the same chance process and record the number of trials it takes to get one success It must pass these conditions:
The variable Y = The number of trials it takes to get a success in a geometric setting
The shape of a geometric distribution is always skewed right
If Y is a geometric random variable with probability of success p on each trial:
The sampling distribution of the sample proportion describes the distribution of values taken by the sample proportion in ALL POSSIBLE samples of the same size from the same population.
The sampling distribution of the sample mean describes the distribution of values taken by the sample mean in ALL POSSIBLE samples of the same size from the same population.
The Central Limit Theorem states that when n is large (>30), the sampling distribution of the sample mean is approximately normal
Shape of the Sampling Distribution of the Sample Mean x:
The Point Estimator is a statistic that provides an estimate of a population parameter
A Confidence Interval gives an interval of plausible values for a parameter based on sample data
Interpreting a Confidence Interval:
A Confidence Level gives the overall success rate of the method used to calculate the confidence interval
Interpreting a Confidence Level:
A Critical Value is a multiplier that makes the interval wide enough to have the stated captured rate
The margin of error gets smaller when:
When the conditions are met, a C% confidence interval for the unknown proportion p is p̂
± ∗√ ̂(1− ̂)
These are the conditions we need for estimating p:
To summarize, these are the conditions for constructing a confidence interval about a proportion:
When the standard deviation of a statistic is estimated from data, the result is called the standard error of the statistic
These are the four-steps you MUST take when constructing a confidence interval:
We can also construct a confidence interval for an unknown population proportion on our calculator by using Stat > Tests > 1-PropZInt
To determine the sample size n that will give us a C% confidence interval for a population with a
maximum margin of error, solve the following equality for n: $\sqrt{\frac{\hat{p}(1-\hat{p})}{n}\ge ME}$
When estimating the population mean using a sample standard deviation, we use a t-distribution:
There is also a different t distribution for each sample size, specified by its degrees of freedom
When the conditions are met, a C% confidence interval for the unknown mean is
These are the conditions we need for estimating μ:
Null Hypothesis (Ho): The claim we weigh evidence against in a significance test
Alternative Hypothesis (Ha): The claim that we are trying to find evidence for
The significance level (α) is the value that we use as a boundary for deciding whether an observed result is unlikely to happen by chance alone when the null hypothesis is true
The p-value of a test is the probability of getting evidence for the alternative hypothesis as strong or stronger than the observed evidence when the null hypothesis is true.
This is the formula to use when asked to interpret a p-value for a one-tailed test:
This is the formula to use when asked to interpret a p-value for a two-tailed test:
0.7 in either direction from a random sample of 160 students in Ivy’s school
This must be included in the conclusion for a significance test:
To summarize, here is everything you should include in a significance test:
When drawing conclusions from a significance test, there are two types of mistakes we can make:
These are the four possible outcomes of a significance test:
If Ha is true:
Our conclusion is correct if we find convincing evidence that Ha is true
We make a Type II error if we do not find convincing evidence that Ha is true
The probability of making a Type I error in a significance test is equal to the significance level
Standardized Test Statistic: Measures how far a sample statistic is from what we would expect if the null hypothesis were true in standard deviation units
These are the conditions for using a standardized test statistic (proportion):
One Proportion Z-Test: To perform a test of Ho: = 0, compute the standardized test statistic
Conditions for using the standardized test statistic (mean):
One Sample t Test for a Mean: To perform a test of $\mu = \mu_0$ compute the standardized test statistic
There is a link between two-sided tests and confidence intervals for a population mean:
The power of a test is the probability that the test will find convincing evidence for Ha when a specific alternative value of the parameter is true
These are some things you can do to increase the power of a significance test:
Sampling Distribution of p̂1 - p̂2: Choose a simple random sample of size n1 from population 1 with proportion of successes p1 and an independent simple random sample of size n2 from population 2 with proportion of successes p2
In a significance test when comparing two proportions, the null hypothesis has this form:
To run a significance test of p1 - p2 = 0, this is the standardized test statistic:
Sampling Distribution of x̅1 - x̅2: Choose a simple random sample of size n1 from population 1 with mean μ1 and standard deviation σ1 and an independent simple random sample of size n2 from population 2 with mean μ2 and standard deviation σ2
In a significance test when comparing two means, the null hypothesis has this form:
To run a significance test of μ1 - μ2 = 0, this is the standardized test statistic:
Source: https:%%//%%apcentral.collegeboard.org/pdf/ap-statistics-course-and-exam-description.pdf