We're starting our formal pass through Chapter 3 on Friday, continuing on Monday and Wednesday, and testing on Thursday. Since we've already covered this material (August and September), we should be able to master these concepts quickly.
This chapter is all about the explanatory and response variables. Although there may be a relationship between the variables, we cannot determine whether one causes changes in the other without an experiment.
The most common way to depict bivariate data (x,y pairs) is through a scatterplot. When we describe a scatterplot we will include descriptions of the form of the data, the direction, the strength, and any unusual observations. Categorical elements can be added by using special markers or colors in the scatterplot.
Correlation is the measure of a linear association. Positive associations will have positive correlation coefficients. Negative associations will have negative correlation coefficients. The correlation coefficient can be thought of as an average product of the z-scores for the x and y components of all the points (if that is any help). Your calculator will compute the correlation for you if you turn Diagnostics On. Some valuable characteristics of correlation are posted on page 191 of the book. Examples of graphs with a variety of correlation coefficients are posted on page 192. These may be useful if you are a visual learner. Please pay attention to the four cautions on pages 192 and 193.
The least squares regression line is a line of best fit that minimizes the sum of the squared vertical distances between the observed points and the regression line. In statistics we usually use the formula y-hat = a + bx for this line, where y-hat is the predicted value of y for a particular value of x.
The slope, b, is interpreted as the average change in y that we would expect for each additional unit of increase in x. Of course, we would cram as much context as we were provided into the interpretation, for instance:
We expect the sales price of the house to increase by $0.55 for every additional dollar spent on the new kitchen.
And the slope is equal to the correlation coefficient times the std dev of y/the std dev of x.
b = r * sy/sx
Extrapolation is the term used when you use the prediction formula with values of x that are outside the reasonable set of values--for instance using a model that predicts a child's height based on age with adult ages.
Residuals tell us how closely the line fits the data AND their pattern tells us whether the linear model is appropriate. Residuals are the difference between the observed value of y and the predicted value of y.
The coefficient of determination, r^2, is the square of the correlation coefficient and a measure of how much of the variability in the y values (from using the mean) could be eliminated by using the least squares predictions instead of the mean of y to predict a value when x is known. It can be thought of as the effectiveness of the x-value in predicting its y-value.
Section 3.3 is FULL of theoretical and helpful philosophical concepts that will help you get the big picture.
Problems from the book due Wednesday, November 9:
3.5, 3.10, 3,13, 3.17, 3.20, 3.24, 3.29, 3.33, 3.35, 3.37, 3.44, 3,47, 3.55, 3.65, 3.85, 3.86
Please note that most of these problems are odd, so their answers are in the back of the textbook. You can check your work as you go along.
Friday, November 4, 2011
We worked through the complete regression problem from data collection to linreg t-test for predicting the weight of a Fun-size bag of Skittles based on the number of candies within.
Monday, November 7, 2011
We went old school today. We measured the diameters and circumferences of a set of balls to find the relationship between the two variables. Our empirical values of the slope (an estimate for pi) ran from about 2.5 to 3.9. We identified all the steps required for a complete answer to the problem and revisited the linReg T test.
The linear regression t-test tells us how likely our experimental slope is if there is really no relationship between x and y. It looks at the ratio of the slope to the standard error of the slope. If the standard error of the slope is larger than the slope itself, then it is quite likely that there is not really a relationship between the two variables.
Interpreting the p-value of the LinReg T test:
If the p-value is very small (less than 5%), then it is unlikely that we would get a slope as "strong" as what we got if there is not really a relationship between the variables. This represents good evidence that the relationship between x and y is legitimately linear--not just accidental.
If the p-value is not very small (greater than 5%), then we do not have evidence that casts doubt on our hypothesis. We cannot be sure if there is or is not a relationship between the variables.
You will have to perform and interpret the LinReg T-test on Thursday's test.
You will also have to interpret output from a computer program for regression. There was a homework problem that asked you to interpret the output. There are a lot of exam problems that ask you to to the same. Today I handed out three more examples in class.
Here's another example that might help you understand what to look for when interpreting the output.
http://www.jerrydallal.com/LHSP/slrout.htm
We are most interested in the values of the y-intercept and slope, the standard error of the slope, the t-statistic, and the p-value.
If you were given a partially-filled out table, could you find the rest of the numbers?
Wednesday, November 9, 2011
Today we reviewed for the test. A copy of some of our computer output is linked here:
Download EXAMPLES FROM CLASS 11-9
Please review the elements from our first test that applied to linear regression. It would be a shame to miss the same questions AGAIN!
There are some questions that mirror the questions from a recent AP exam near the bottom of the document above. Make sure that you can answer these questions.