Home
Blog
Linear statistical models Discussion

Linear statistical models Discussion

Daniel Kevins

0 comments

1 pages discussion

After this write 2 reply on others discussion

about 100 words for each reply.

PROMPT:

Research a linear model, then check the results using we need specifics here

Lesson

In this final module of new material, we now will apply much of what we have so far learned to formulate a statistical model. Regression models are a form of supervised learning – we have a means of guiding us in the development of a model by examining how well our model predicts the observed data. Other techniques, such as clustering or principal components analysis, don’t have a mechanism for testing the accuracy of their predictions. They rely on the “reasonableness’ of the results they arrive at.

The regression models general form is dependent variable or response = f(independent variables or predictors or features),

where f is a linear function. The regression model is linear in the sense that the coefficients of the independent variables don’t appear in any form except as a constant or as a multiplier times an independent variable. Also, we are going to restrict our attention to dependent variables whose values are not restricted to particular categories, for example, buy a product or not buy a product. The independent variables themselves can take many forms. For example, a common technique in economics is to use so-called dummy variables or indicator variables that take the value one if the condition is present for that observation and are zero otherwise. For example, if we were predicting student GRE scores, we might want to include a dummy variable for whether the student went to a private university or not. The coefficient on this variable would be interpreted as the difference that going to a private university had on GRE scores in comparison to the average. We can also transform the independent variables and that will allow us to perform polynomial regression. In fact, any transformation of the independent variables is permissible, so long as their coefficients only appear as multiplicands or constants.

There are a series of questions we will want ask about our model: 1) How strongly do the independent variables relate to the dependent variable? This is a question for the set of independent variables as a whole. This can also raise the question of whether we have the right form for the regression? Perhaps linearity is not the right assumption. 2) How accurate is our model, in other words, how well do we predict the dependent variables within our data set? Again, this is really a test of the regression model as a whole. 3) How strong is the relationship of the individual independent variables with the dependent variable? Although we begin our analysis with an ensemble of potential predictors, perhaps some of them more strongly relate than others. Further, when we build models, we like to have Ockham’s Razor in our backpacks, so we would like to have an accurate forecast, but with as few variables as possible. As we examine the individual variables, it may be that we need to transform them in some way. 4) We may want to divide our data into two parts – a training data set and a test data set. On the training data set we develop the model. We then use the test data set to see how well we predict observations outside of our data set. This leads to the question – How well do we predict outside of our training data set? 5) Some observations may seem aberrant in the sense that they have prediction errors, residuals, that are significantly larger than the model’s predictions for other observations. These are known as outliers. This brings us to the question – Are these observations somehow different from the others. Is there a problem with the data collection for this observation? is there some substantive difference between these observations others?

We will discuss several statistics that are computed in regression analysis. In the simple linear regression, we can compute what is formally the Pearson product-moment correlation coefficient, also known more briefly as the correlation coefficient, r. It is a measure of the strength of a linear relationship between the dependent and independent variables. r can vary between -1 and 1. We must be careful to not use the correlation coefficient as a metric of any relationship between the dependent and independent variables because the relationship could be nonlinear. As an example, if you were to calculate the correlation coefficient between y and x for a parabola y = x2, you would find the correlation coefficient is zero, yet these two variables are clearly related. The square of the correlation coefficient, r2, measures the percent of the variation in y “explained” by the variation in x. r2 can vary between 0 and 1, with zero implying no capability of x to explain y’s variation and 1 meaning that x can explain all of y’s variation. We can apply a t-test to assess the significance of the independent variable and that will simultaneously test the regression as a whole.

We then discuss multiple regression, multiple because there is more than a single predictor. The correlation coefficient is no longer meaningful because there is more than one independent variable. We can, however, estimate our regression model and then correlate the predicted and actual values of the response variable and that will be a measure of the linear correlation of the response and predictor variables. It is called the multiple correlation coefficient, R (I guess it’s capitalized to distinguish it from the correlation coefficient). Unlike r, it varies only between 0 and 1 and so doesn’t indicate the direction of the relationship as the correlation coefficient does. Its squared value is known as the coefficient of determination, R2, like r2 measures the percentage of variation in the response variable explained by the predictor variables. Commonly the coefficient of determination is adjusted for the number of variables. That adjustment prevents it from increasing merely because an additional variable has been added.

The overall regression is tested using what is known as the F-statistic which is based on the regression model as an ANOVA analysis. The individual coefficients are tested using a t-test where the t-statistic for the test is the coefficient value divided by its standard error.

About the Author

Daniel Kevins

Follow me