Tutorial 6:

Regressions Modelling


In this session you will learn about General Linear Models. Remember T-tests and ANOVAs? They are actually an application of Linear Regressions. In this session you will learn about the following techniques:

These techniques allow you to explore correlations between data. Regressions are considered predictive tools which use statistical models. These models allow you to test your hypothesis and explore how different factors (independent variables) affect your correlation. 

Additonal Learning Materials

Dataset & Variables

For the examples in this tutorial we will be using the following dataset:

Skoczylis, Joshua, 2021, "Extremism, Life Experiences and the Internet", https://doi.org/10.7910/DVN/ICTI8T, Harvard Dataverse, Version 3.

Linear Regressions - Variables required

Your dependent variable must be a continuous. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression. 

Binomial Logistic Regression - Variables required

Your dependent variable must be binomial (two levels, e.g. Male/Female). Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression. 

Multinomial Logistic Regression - Variables required

Your dependent variable must have three or more levels (e.g. Green/Red/Blue). These levels can be nominal or ordinal. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal /ordinal variables, these are turned into dummy variables within your regression. 

Ordinal Logistic Regression - Variables required

Your dependent variable must be an ordinal variable. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression. 

Concept: Linear Regressions

Okay, let's start with the basics. What hypothesis does a linear regression test?

Linear regressions allows you to model a relationship between one dependent and one or more independent variable(s). If you have more than one independent variable we call the process multi-linear regression.  Note, in a simple linear regression with nominal or ordinal independent variables, you are essentially running a T-Test or ANOVA, but you do get some additional information such as the RSquare.

Note: Most of the time the relationship between the dependent and independent variable(s) is linear. However, that the linearity actually refers to the linearity of the the equation, and not the line. This means you can have curvy-linear relationship in a linear regression (if you are interested you can read more in Frosts Book 'Regression Analysis' pp. 93-110).

Below are some videos that explain linear and multi-linear regression in some details. The video does make some references to some basic equations, but you should be able to follow and understand the concepts, without understanding the equations mentioned. For a more simple overview you can watch this video clip here, but we highly recommend watching the more extensive ones below. 

It is helpful here to briefly discuss the equation of the line. The formula is:

Y=a+bX+ error.

Y is your dependent variable, X is the independent variable, and b is the slope of the line with a the intercept of the y-axis. Error is the difference between the observed datapoints and the predicted values.

In short, as your value of the independent variable goes up or down, the value of the dependent variable will increase/decrease accordingly. 

As mentioned, linear regressions are not limited to one independent variable. You can have multiple independent variables. The video below explores this in more details. 

Note, results for linear regressions will always display the intercept. You can usually ignore this value when reporting your results. 

It is worth noting, that your results are variable dependant. This means, if you remove or add variable(s) your results will changes. 

Linear Regression: Interpreting your Output

If you want to learn more about how the line is fitted, watch this Stats Quest video

It is worth quickly re-capping what RSquare and adjusted RSquare are. In brief, R Squared is the proportion of the variation in the dependent variable that is predicable from the independent variable(s). RSquared and adjusted RSquare is a number from 0-1. So 0 means 0% of the variation is explained while 1 means 100% is explained. 

When using more than one independent variable always use adjusted R Square, rather than RSquare. The adjusted RSquared gives you a more precise view as it takes into account how many independent variables you have added to the model. In a model with more than one independent variable the RSquare will continue to go up giving you skewed results. Hence we use the Adjusted RSquare. A very easy to follow along video explaining RSquare is  available here

Interpreting RSquared:

Estimate Coefficient:

The other value that you get when running a regression is the Estimate Coefficient. This represents the relationship between the dependent variable and the independent variable. 

The Estimate Coefficient represents the mean change in the response given a one unit change in the independent variable. This means that if your Estimate is +0.5 your dependent variable will increase by 0.5 for each unit increase of the independent variable. 

As noted above, nominal/ordinal variables are represented using dummy variables. A dummy variable transforms nominal/ordinal data into a series of variables that take the value 0 and 1, where 0 indicates the absence or presence of something. (e.g. a 0 may indicate 'over 65' and 1 'under 65'). For more details how how dummy variables function follow this link. There are, surprise surprise, other ways of dealing with nominal variables. This is not covered here, but you can read more about it here

Note, you can treat ordinal data as continuous, however, this implies that you are removing its constrains and are treating the variable as if the distance between each step is equal - which it is not. Poor quality data - means poor quality results. 

The coefficients of dummy variables measure the average difference between the group coded with the value 1 and the group coded with the value 0 (the default or reference group). Most stats programmes should allow you to set your reference group easily. 

Standard Error of the Regression

Regression output will also display the Standard Error of the Regression or Standard Error. The SE represents that mean distances of the observed values from the regression line. It give us an indication of how wrong our regression model is on average. The smaller the SE, the better. Remember the SE is measures in Standard Deviations. So you would expect 95% of all observations to fall within -/+ 2 SD away from the mean. In some ways if you are trying to predict outcomes, looking at the SE can give a more accurate picture of how good the predictions are. 

Let's take a look at an example.

Let's quickly try and look at the results of the regression above. Remember, we have not checked our assumptions yet, so the results are only preliminary.  

As we can see the p-value shows that the model is significant (<.001), but the the Adjusted RSquared is not great at 0.162. 

As you can see each of the predictor variables (independent variables) has a significant p-value. If you have variables that are not significant, consider removing them from your model, but be aware that this will change all of the results. It is worth pointing out, that sometimes it is also worth reporting values when they are not significant, especially if you expected them to be so. 

Let's look at two of the variables Gender and Confidence in Self. 


The reference level here is Male. The difference between men and women is significant with a p-value of <.001. Looking at the Estimate, we can see that the mean difference in the extremism score for men and women is -0.495. This means women have a lower extremism score. Looking at the Confidence Intervals we can be 95% sure that the difference lies somewhere between -0.642 and -0.347. The SE is small, which is good. 

Confidence in Self:

Again the p-value indicates (0.006) that this relationship is significant. The Estimate tells us that for each unit increase in the Extremism Score confidence in self drops by -0.100. Looking at the Confidence intervals we can be 95% sure that the drop would be somewhere -0.171 and - 0.029. Again the SE is pretty small. 

The plot below visualises the relationship between Gender and Confidence in Self. As you can see the Extremism score for men is higher than that for women. We can also see a downward trend in self-confidence as the Extremism score increases for both men and women. 

Overall though, given the Adjusted RSquare value this is not a great model. 

Linear Regression: Assumptions

Before we accept the results of any linear regression we need to make sure that all of the a series of assumptions are met. 

The assumptions are:

A Scatterplot should give you a good idea if your data has a linear or curvy-linear relationship

You can check the two assumptions above by looking at a Q-Q plot and residual plots. There are also test that test for normality - if you get a p-value of <.05 then the assumption is violated.

To check that this assumption is met you use the VIF (Variance Inflation Factor) and the Tolerance. The VIF should be as low as possible. It should be no more than 4 or 5. The Tolerance should be above .20. 

A quick note on interpreting Residual Plots and Q-Q plots.

Residual Plots:

When you check your residual plots, you need to make sure that there are no patterns. If there are, then you will need to use alternative methods to analyse your data. Sometimes, it just indicates that the relationship is curvy-linear. 

Q-Q Plots:

You can also check the Q/Q plot to check for normality assumptions. Essentially, you need to ensure that the datapoints follow the line. In the below example you can see that the normality assumptions are not met as the datapoints are nowhere near the line. For a detailed explanation watch this video on quantiles and this video for interpreting Q/Q plots

As you can see from the above plots and tables, most of the assumptions are violated. The Q/Q plot and the residual plots clearly indicate that the normality assumptions are not meet. In terms of Collinearity, however, the values are within the expected range, so collinearity is not an issue for this model. 

The most important assumption of Normality is not met. So we need to consider other options. 

Below is a short video clip that goes through how you check your assumptions. Most Stats programmes should make it easy to generate the right plots and run the appropriate test to check the assumptions.

If Assumptions are not met

If the assumptions are not met consider the following:

 If the assumptions are still not met, consider transforming your dependent variable into an ordinal/nominal variable. You can then use this new variable in Logistic Regression. 

Finally, you can use a Quantile or Robust Regression. These are however, not covered in this tutorial. which are not covered in this tutorial, or transform your dependent variable into an ordinal/nominal variable that you can use in Logistic regressions. 

Concept: Logistic Regression

So far we have only focused on continuous variables. Logistic regressions allows us to use have nominal and ordinal dependent variables. Logistic regression is another extension of the linear regression discussed above. There are three types of Logistic regressions:

It is worth watching the below videos which provide you with an in-depth introduction to the concept of Logistic Regression. For a shorter and less extensive introduction watch this clip here.

It is also worth watching Logistic 'Regression understanding its Coefficients' and 'Logistic Regressions RSquared' which will help you learn how to interpret the outcomes of a Logistic Regression. 

Multinominal & Ordinal Logistic Regression:

These are special cases of the Binomial Logistic Regressions. Your dependent variable must have more than two categories. Multinomial Logistic Regression is often considered an attractive analysis because; it does not assume normality, linearity, or homoscedasticity.

Note, you can use a Multinominal Regression for ordinal variables, but you should never use an Ordinal Logistic Regression for nominal variables. 

For an Ordinal Logistic Regression your dependent variable should have the same effect on the odds of moving to a higher-order category everywhere along the scale. This means that the odds from jumping from Like Very much to Like A lot should be the same as the odds from jumping from Dislike to Dislike a lot. This is a condition that is rarely met in reality. If the condition is not met, use a Multinominal Regressions. For a more detailed description follow this link

Logistic Regressions: Understand Odds and Odds Ratios

To make sense of the outputs of Logistic regressions you need to understand the concepts of Odds and Odds Ratios.

Odds measure the likelihood of a particular outcome occurring. They are calculated as the ratio of the events that produce a certain outcome and odds of those that do not. So for a six sided dice there is a 1 in 5 chance of rolling a 6 on (1 chance of getting the outcome v 5 times of not getting this outcome), or 0.2 - 20% chance of rolling a 6. 

Odds Ratio is the measure of association between two events. The odds ratio represents the odds that an outcome will occur given a certain event, compared the odds of the outcome not occurring given the absence of this event.

The video below provides you with a more detailed overview of what Odds and Odds Ratios are. 

Interpreting Odds: If you have an odds ratio of 1.4 you can say that they is 40%increase in in the odds given the exposure. A odds ratio of 2 means 100% increase or a doubling of the odds. An odds ratio of 0.2 on the otherhand means an decrease of the odds by 80% given the exposure. 

For a much more in-depth explanation of Odds and Log(Odds) watch this Stats Quest video. A follow up video on Odds Ratios is also worth watching. 

It is highly recommended that you watch the two Stats Quest videos mentioned above. 

Binomial / Multinomial Logistic Regression: Interpret the Outputs

In many ways interpreting Logistic Regressions is similar to interpreting the outcomes of a Linear Regression. 

Enter the world of Pseudo RSquared. Is it more difficult to calculate the RSquared for Logistic Regressions. We therefore use something called the Pseudo RSquared instead. You report this in the same way you would report RSquared for a Linear Regression. Note, don't expect it to be as big as RSquared. 

The most commonly used Pseudo RSquared is McFadden's here anything between 0.2 and 0.4 is an excellent fit for your model. 

The Estimate is the Log(odds) and Z is the Estimate standardised - so the value tells you how many standard deviations the estimate is away from the mean. The higher the Z value, the less likely it is to be significant. 

Let's explore how to interpret the results of a Multinominal Regression (you interpret a Binomial Regression in the same way).

In the Example below we explore the impact of Age and Gender on having Qualification or not. 

We have set no Qualification and Male as the reference levels 

As you can see from the big table above, there is a significant correlation between the selected variables (p-value <.001). The Pseudo RSquared is not great at 0.026. 

Let's have a look at the third row - the likelihood of having a postgraduate degree versus not having Qualification. Based on the p-value we see there there is a significant difference due to gender (p 0.022) and age (p 0.008). The odds ration for this row indicates that woman are 3.766 time more likely to have a postgraduate degree than men. Looking at the Confidence intervals, we can be 95% sure that this outcome lies somewhere between 1.211 and 11.715. 

In terms of age, we can see that as age goes up the likelihood of getting a postgraduate degree goes down. For each year you get older, the odds of having a postgraduate degree go down by 0.943. We can be 95% sure that the true value is somewhere between 0.910 and 0.981.

Let's visually display the results.

Above the ages are:

Ordinal Logistic Regression: Interpret the Outcomes

Ordinal Logistic Regressions are interpreted similarly to Binomial/Multinominal Logistic Regressions. When reporting the output, you should also report the Thresholds (each threshold tells you when a score moves from one to the next category. 

Using the thresholds you can predict/workout out what category your scores would fall in. You can calculate it as follows: 

(Predictor 1*Model Coefficient)+(Predictor 2*Model Coefficient) + ... etc

You then compare the outcome with the Threshold which will tell you where your prediction will fall. 

Note you can do this for one, some or all of the significant predictor variables. To read more follow this link.