This video, we'll go through an example of regression modeling in a new context. The context is whether there's a relationship between a child's language development and the amount of time a child spends in preschool. We have data from a random sample of 34 preschoolers. It has data on the number of hours they spend at preschool each week and their score on a language development test out of 100 points. Here is the output from statistical software called R. For this data, we see that the intercept term B zero is 76.2 735. The slope term B one is 0.6 007. We also see, for example, the t statistic, if we wanted to test the beta one is equal to zero. We see the R squared value, All sorts of information about our data that's been computed by the software in this analysis. What role is the variable language score playing in the skipped over this? Here's the scatter plot of the data along with the regression output. We see that the language score is our Y variable and the hours spent at preschool is our X value variable. We see that, for example, in the data set, X is between about, maybe 4.30 it looks like, and y is between about 70.100 it looks like it. Based on the data, there does seem to be a linear relationship between x and what role is language score playing? What's the correlation between pre school hours and language score? And how much would you expect language score to increase on average for each 1 hour increase in time spent a preschool? Language score is the response variable, remember, language score was the Y variable in our data, and that's what we call the response variable. Computing the correlation here requires us to remember the relationship between the correlation coefficient r, and the coefficient of determination, r squared. That R squared t is just the square of the correlation coefficient. If you go back to the output, it tells us on the second to bottom line, the so called multiple R squared value is 0.3 541. This is the correlation squared 0.3 541. I want the correlation itself. I take the square root of that, I take the square root 0.3 541, and that turns out to be 0.5 951. Then finally, how much would you expect language squared to increase on average for each 1 hour increase in time spent? Well, that's what the slope is measuring. The slope of the regression line from our data was 0.6 007. Then just giving the equation of the squares regression line again, we go back to our output. The intercept estimate is 76.2 735. Slope estimate is 0.6 007. This is pretty straightforward. Y had equals 76.2 735 plus 0.6 007 times x. Then we're asked about to compute a residual. This is for a preschooler who attends 15 hours each week and scored 75 on the language development test. They have an x value of 15 and a y value of 75. To compute the residual, what we're going to do is we're going to plug x equals 15 into the equation, and then compute the Y hat value. And then we'll take 75, the actual y value minus Y. Here we go, plug 15 into the regression equation, get 85.248 The actual y value was 75. 75 -85.248 is negative 10.24 That's computing the residual. Computing residual always boils down to plugging in the x value into the regression equation, computing the Y hat value, the predicted y value, and then subtracting that from the observed. What about hypothesis testing? We want to test whether there's a linear relationship between preschool hours and language score, while our hypotheses are the same. As always, beta one equals zero and beta one is not equal to zero. The test statistic and P value came from the output. Let me scroll back up to the output. You'll see in the starting with the word hours, that we have a test statistic of 4.188 and a p value of 0.00 0206. That's where I got those values from. Then that, in terms of our table of evidence, would be extremely strong evidence in favor of a lender relationship. Now, why would it not be appropriate to use the least squares regression line to estimate the language score for a child who spends over 40 hours a week in preschool? Well, remember, we looked at the data and noticed that the largest value was about 30 hours per week. And this would be extrapolating well beyond the range of the data. Then what about the conditions for inference? The conditions for inference would include linearity, constant variance, and normality of the residuals. All of those seem to be appropriate here. The histogram is a little bit odd, but it's a relatively small data set, and sometimes histograms can not be so revealing. When you have just a small amount of data, the QQ plot looks good. And I'm not too worried about the normality assumption, I'm not too worried about the linearity assumption of the constant variance assumption, independence these plots don't really help with. But we're told that the data came from a random sample, gives me some confidence that we have independence.

LangDevelopment

From Vincent Melfi November 25th, 2023  

346 plays 0 comments
 Add a comment