Tutorial 10: Regression Modelling III

In this session, you will learn the following:

These techniques allow you to explore correlations between data. Regressions are considered predictive tools which use statistical models. These models allow you to test your hypothesis and explore how different factors (independent variables) affect your correlation. 

10.1 Odds Ratios

10.1.1 Odds and Odds Ratio

The concepts of probabilities and odds are fundamental in statistics, and while they are related, they describe different aspects of likelihood:

Probabilities, a re-cap:

Probabilities quantify the likelihood of an event happening, measured on a scale from 0 to 1, where 0 indicates impossibility and 1 indicates certainty. The probability of an event is calculated as the number of favourable outcomes divided by the total number of possible outcomes. For example, the probability of rolling a 3 on a standard six-sided die is 1/6, as there is one favourable outcome (rolling a 3) and six possible outcomes in total.

Odds

Odds, on the other hand, compare the likelihood of an event happening to the likelihood of it not happening. They are expressed as a ratio of two numbers: the number of favourable outcomes to the number of unfavourable outcomes. Using the dice example, the odds of rolling a 3 are 1 to 5, because there is one favorable outcome (rolling a 3) and five unfavorable outcomes (rolling a 1, 2, 4, 5, or 6).

There are some key differences, these are: 

Odds Ratio

An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, often used in studies to compare the occurrence of outcomes across different groups. Let's apply this concept to a political science example, examining whether attending political rallies increases the likelihood of voting in an election.

First, calculate the odds of voting for people who attend political protests versus those who do not. Suppose in a study of 100 protest attendees, 80 voted in the last election, giving odds of 80:20, or 4 (since 80 out of 100 voting means 20 did not, so the odds are 80 divided by 20). In a comparable group of 100 non-attendees, 50 voted, giving odds of 50:50, or 1. Now we can compare the results of the two groups using odds ratios (OR). The odds ratio is the ratio of these two odds. For our example, the OR is 4 / 1 = 4.

Interpretation:

OR = 1: No difference in odds between the two groups. Protest attendance doesn't affect voting behaviour.

OR > 1: Higher odds in the first group. In our case, an OR of 4 suggests that individuals who attend political rallies have four times the odds of voting compared to those who don't attend, indicating a strong association between protest attendance and voting.

OR < 1: Lower odds in the first group. If it were less than 1, it would suggest that attending a protest is associated with lower odds of voting, which isn't the case here.

In summary, the odds ratio in this context helps us understand the association between attending political protests and the likelihood of voting in an election. An OR greater than 1, as in our example, signifies a positive link between rally attendance and higher voting rates, offering insights into how political engagement activities can influence electoral participation.

For a much more in-depth explanation of Odds and Log(Odds) watch this Stats Quest video. A follow-up video on Odds Ratios is also worth watching. 

It is highly recommended that you watch the two Stats Quest videos mentioned above. 

10.1.2 Interpreting Odds Ratios and log (odds)

Interpreting an odds ratio (OR) involves understanding the measure of association it provides between an exposure and an outcome in a study. An odds ratio quantifies how the odds of the outcome change with the exposure compared to without it. Here's how to interpret an odds ratio in various contexts:

Odds Ratio Greater Than 1

An OR greater than 1 indicates that the exposure is associated with higher odds of the outcome occurring. For example, an OR of 2 means that the exposed group has twice the odds of experiencing the outcome compared to the unexposed group. This can be interpreted as a 100% increase in the odds of the outcome occurring with the exposure.

Odds Ratio Less Than 1

An OR less than 1 suggests that the exposure is associated with lower odds of the outcome. For example, an OR of 0.5 implies that the exposed group has half the odds of experiencing the outcome compared to the unexposed group, or a 50% decrease in the odds.

Odds Ratio Equal to 1

An OR of 1 indicates no association between the exposure and the outcome. The odds of the outcome are the same in both the exposed and unexposed groups. This means that with an OR of 1 the exposure does not affect the odds of the outcome occurring.

The significance of an OR depends on the context of the study, including the magnitude of the effect and its relevance to the research question or clinical practice. Always consider the confidence intervals around an OR to assess the precision of the estimate. Narrow CIs indicate a more precise estimate of the OR, while wide CIs suggest less certainty. If the OR confidence intervals cross 1, this indicates no effect. An OR does not directly translate to risk or probability of the outcome. It measures the odds, which can be less intuitive than probabilities, especially when the outcome is common. In summary, interpreting an odds ratio involves understanding the direction and magnitude of the association it describes between an exposure and an outcome. An OR above 1 indicates increased odds, an OR below 1 indicates decreased odds and an OR of 1 indicates no change in odds due to the exposure.

_____________________________________________________________________________________

Interpreting log odds

Interpreting the log of odds, often encountered in logistic regression analysis, involves understanding how changes in predictor variables affect the odds of a certain outcome on a logarithmic scale. The log of odds transforms the odds ratio, making the relationship between variables linear and easier to model statistically. Here’s a simplified explanation:

To convert log odds back to regular odds for easier interpretation, you would exponentiate the coefficient. This gives you the odds ratio, which can be more intuitively understood as the factor by which the odds multiply for a one-unit increase in the predictor variable.

In essence, interpreting the log of odds allows you to understand the direction and strength of the relationship between predictor variables and the outcome in a logistic regression model, with the transformation facilitating the modelling of non-linear relationships in a linear framework.

For a much more in-depth explanation of Odds and Log(Odds) watch this Stats Quest video. A follow-up video on Odds Ratios is also worth watching. 

It is highly recommended that you watch the two Stats Quest videos mentioned above. 

10.2 Logistic Regression

So far we have only focused on using continuous variables in regression modelling. Logistic regressions allow us to use nominal and ordinal dependent variables. Logistic regression is another extension of the linear regression discussed above. There are three types of Logistic regressions:

It is worth watching the below videos which provide you with an in-depth introduction to the concept of Logistic Regression. For a shorter and less extensive introduction watch this clip here.

It is also worth watching Logistic 'Regression Understanding its Coefficients' and 'Logistic Regressions RSquared' which will help you learn how to interpret the outcomes of a Logistic Regression. 


10.2.1 Binomial Logistic Regression

Logistic regression is a way to explore the relationship between a series of factors (like age, income, and education) and a particular outcome that has two possibilities (like voting for candidate A or not voting for candidate A). The tricky part is that this outcome is like an on-off switch (0 or 1), which doesn't fit well with regular straight-line (linear) predictions.

To handle this, logistic regression takes the on-off switch and stretches it out on a scale from 0 to 1 using logs. This stretching turns our simple on-off outcome into something that can be mapped with a curve, allowing us to use a linear approach. In essence, it transforms the prediction problem into figuring out the odds of the switch being on (1) or off (0), based on the factors we're looking at. By using logs, logistic regression can make predictions and show how different factors increase or decrease the likelihood of the outcome happening.

The output of a Logistic Regression looks very similar to that of a Linear Regression. There are, however, some significant differences. 

In logistic regression, the coefficients represent the log odds, which is a way of quantifying the relationship between each predictor variable and the probability of the outcome occurring. Log odds can be a bit tricky to interpret directly, but they essentially tell us how the odds of the outcome change with a one-unit increase in the predictor variable. When you exponentiate a coefficient, you get the odds ratio, which is easier to understand – it shows the factor by which the odds of the outcome increase (if the odds ratio is greater than 1) or decrease (if it's less than 1) for a one-unit increase in the predictor. Most programmes will also provide you with the OR which is easier to understand and interpret. 

Because logistic regressions don't have an equivalent to the R-squared statistic from linear regression (which measures how well the model explains the variability of the data), we often use a Pseudo R-squared measure instead. Pseudo R-squared values provide a way to gauge the model's explanatory power, but they don't have a direct interpretation like the R-squared in linear regression. There are several types of Pseudo R-squared, with Nagelkerke’s R-squared being one popular option. These measures give us a rough idea of how well our model fits the data, but they should be interpreted with caution and understood as not directly comparable to the R-squared from linear regression.

Interpreting Pseudo R Squared

There are several different Pseudo R Squared below are two commonly used ones. Pseudo R-Square typically ranges from range from 0 to just under 1. However, its values are generally much lower than what you would expect from a linear regression R-squared. A value of 0 indicates that the model has no explanatory power, and values closer to 1 indicate a better fit.

McFadden's R-Squared

Nagelkerke’s R-Squared

Key Points for Both

In summary, McFadden's and Nagelkerke’s R-Squared values provide useful, albeit rough, measures of model fit in logistic regression, with Nagelkerke’s adjustment offering a scale that is perhaps easier to interpret for those familiar with linear regression. However, their interpretation should be contextual and cautious, particularly when comparing across different types of models.

Above you see the typical output of a Binomial Logistic Regression. You can of course also create additional visualisations, Marginal Means plots and tables etc. 

The R-Squared for example indicates that this might be an okay fit, for the data McFadden (0.21). 

Interpreting the Output

The Intercept

The intercept (−0.83) represents the log odds of being in the "Extreme" category when all predictors are at their reference levels (zero or baseline category). The odds ratio of 0.43 suggests lower odds of being "Extreme" at the baseline levels of predictors.

Predictor Coefficients

Each predictor’s estimate represents the change in the log odds of being "Extreme" for a one-unit increase in that predictor, holding all other predictors constant.

Impact_Globalisation: A coefficient of −0.25 indicates that as the impact of globalisation increases by one unit, the log odds of being "Extreme" decrease, with an odds ratio of 0.78 suggesting a decrease in odds.

Age: For each one-year increase in age, the log odds of being "Extreme" decrease by 0.03, with the odds ratio of 0.97 indicating a slight decrease in odds.

Gender (Female as the reference group): Being male (compared to female) increases the log odds of being "Extreme" by 1.07, translating to an odds ratio of 2.91, suggesting that males have higher odds of being "Extreme" than females.

Political_LeaningRight_Left: A coefficient of −0.35 means that moving from right to left in political leaning decreases the log odds of being "Extreme", with an odds ratio of 0.71 indicating reduced odds.

Significance (p-values)

Odds Ratios

Strains and Social Factors

Has Degree Yes/No

Having a degree (Yes – No) with a coefficient of −0.39 and an odds ratio of 0.68 suggests that having a degree slightly decreases the odds of being "Extreme", though this effect is not statistically significant (p = 0.105).

This table offers a comprehensive view of how various factors influence the likelihood of being categorized as "Extreme", combining individual, social, and political predictors to provide insights into the factors associated with extreme responses.

Below we can see a visual representation of Age, Gender and Political Leaning. 

Binominal Logistic Regression:

Your dependent variable must be binomial (two levels, e.g. Male/Female). Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression. 

10.2.2 Multinominal Logistic Regression

Multinomial logistic regression is a statistical method used when you want to predict outcomes that fall into three or more categories, which are not ordered. For example, if you're studying voters' preferences for different political parties (e.g., Party A, Party B, Party C), multinomial logistic regression can help you understand how different factors (like age, income, or education) influence someone's likelihood of preferring one party over the others. 

Unlike binary logistic regression, which deals with outcomes that have two possible states (yes/no, win/lose), multinomial logistic regression handles multiple categories by comparing the log odds of being in one category to the log odds of being in a baseline category, for each predictor variable. This allows you to model and predict more complex relationships where the outcome isn't just a simple yes or no but includes several distinct options.

They are generally interpreted the same way as a binomial logistic Regression. The big difference is that it will provide you a breakdown for each level compared with the other levels. 

Here you can see is give us the output of each level in the table compared t the reference level which is Strongly Agree. 

Again you can visualise the output using Marginal Means plots and tables. 

Your dependent variable must have three or more levels (e.g. Green/Red/Blue). These levels can be nominal or ordinal. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal /ordinal variables, these are turned into dummy variables within your regression. 

10.2.3 Ordinal Logistic Regression

In ordinal logistic regression, which is used for predicting an ordinal dependent variable (a variable with categories that have a natural order, but no consistent interval between them), model thresholds (or cut points) play a crucial role. These thresholds separate the categories of the dependent variable based on the predicted probabilities.

Understanding Model Thresholds

Model thresholds are values derived from the model that define the boundaries between the ordinal categories. For an ordinal outcome with k categories, there will k-1 thresholds. These thresholds are estimated during the model fitting process and are used to determine the predicted category for each observation based on its predicted probabilities.

Interpreting Model Thresholds

Model thresholds in ordinal logistic regression offer a mechanistic insight into how the model discriminates between the ordered categories based on the predictor variables. Understanding and interpreting these thresholds help in grasping the underlying dynamics of the model predictions and the influence of predictor variables on the ordinal outcome.

This output from an ordinal logistic regression model examines factors that might influence people's agreement with the statement Share nothing Society. The dependent variable has five ordered categories: Strongly Agree, Somewhat Agree, Neither Agree nor Disagree, Somewhat Disagree and Strongly Disagree.

Model Fit Measures

Deviance: A measure of model fit; lower values indicate better fit.

AIC (Akaike Information Criterion)**: Another fit index, where lower values suggest a better model.

R² McF (McFadden’s R-Squared): A pseudo R-Squared value indicating the proportion of variance explained by the model. At 0.02, it suggests the model explains a small amount of the variance in the response variable.

Model Coefficients

The coefficients indicate how each predictor variable is expected to affect the log odds of being in a higher versus lower category of agreement with the Share nothing Society.

Positive Estimate: Predicts higher odds of agreeing more with the statement. For instance, "Impact_Globalisation" and "Racism_scale" have positive estimates and significant p-values, indicating they're associated with greater odds of stronger agreement.

Negative Estimate: Predicts higher odds of disagreeing more with the statement. "Extremism_Score_scaled" and "Gender: Female - Male" have negative estimates, suggesting they're associated with greater odds of stronger disagreement, though their p-values are not significant, indicating a weak or non-existent relationship.

Odds Ratios (OR): Values above 1 indicate higher odds of falling into a higher agreement category with a one-unit increase in the predictor. OR below 1 suggests higher odds of falling into a lower agreement category. For instance, "Age" has an OR close to 1, suggesting a minimal impact on agreement level.

Model Thresholds

The model thresholds in an ordinal logistic regression are the points that separate the different levels of the outcome variable. In this case, the outcome variable is the level of agreement with the "Share_nothing_Society" statement. Here's how to interpret the thresholds:

Threshold Estimates: Each threshold estimate represents the log-odds of being at or below a certain level of agreement versus being above it, given that all predictors are at zero. These are the points along the logit (log-odds scale) where the probability of falling in one category or the next higher category is equal (50/50 chance).

Significance: The Z-scores and associated p-values for each threshold estimate tell us whether the threshold is significantly different from zero. In this table, all p-values are below .05, indicating that each threshold significantly separates the levels of agreement.

In practical terms, these thresholds can help us understand where the 'cut-off' points are between different response categories and how certain the model is about these distinctions. They are particularly useful for making predictions about where new observations might fall in the ordered outcome categories based on their predictor values.

Summary

The model suggests that variables like the perceived impact of globalisation and experienced racism are significantly associated with stronger agreement with the "Share_nothing_Society" statement, while extremism and being female are less clear in their impact. The thresholds show where the odds shift in favour of one category over another. The overall model has a low explanatory power, indicating that other unmeasured factors may also influence individuals' levels of agreement with the statement.

Your dependent variable must be an ordinal variable. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression. 

10.3 Additional Learning Materials