Tutorial 9: Regression Modelling II

In this session, you will learn the following:

Always make sure you check your assumptions. Some assumptions are more important than others. If the assumptions are not met, then consider other models such as logistic regression (covered in the next tutorial).

Remember, if the assumptions are not met, it is possible that your output will not explain your data - rubbish in, rubbish out. 

9.1 Linear Regression Assumptions

9.1.1 Why we check for assumptions

Checking for the assumptions of a statistical model, like linear regression, is fundamental for several reasons. Ensuring these assumptions are met is critical for the accuracy, reliability, and validity of the conclusions drawn from the model. Here's why checking for assumptions matters:

Checking that the models' assumptions are met is a critical step in any statistical analysis. It ensures that the model is appropriate for the data and that the conclusions drawn from the analysis are valid and reliable.


When assumptions are not met: 

If the assumptions are not met, consider the following options:

Watch an in-depth explanation of why we test for assumptions and what we do if they are not met here. A much short introduction to assumption testing is below.

9.1.2 Assumption 1: Linearity

The assumption of linearity is fundamental in many statistical models, particularly in linear regression, where we need a straight-line relationship between independent and dependent variables. This assumption implies that any change in an independent variable will result in a consistent change in the dependent variable, regardless of the value of the independent variable. Ensuring this assumption is met is crucial for the model to accurately represent the relationship between variables and for the validity of inference made from the model.

Checking for Linearity

Linearity can be assessed through several methods:

Addressing Linearity Issues

When the linearity assumption is violated, you can use several strategies:

Important Considerations

Transformations and the addition of polynomial terms can make interpreting model coefficients more complex. It's essential to understand how these changes affect the relationship between variables. Adding polynomial terms or segmenting the data increases the model's complexity. It's crucial to balance the need for capturing non-linearity with the risk of overfitting, where the model fits the training data too closely and performs poorly on new data. Ultimately, the choice between transforming variables, adding polynomial terms, or switching to a non-linear model depends on the specific research question, the nature of the data, and the interpretability of the model.

While the assumption of linearity simplifies modelling and interpretation, real-world data often exhibit complex relationships that require careful assessment and potentially sophisticated modelling strategies to accurately capture.

9.1.3 Assumptions 2: Independence

The assumption of independence stipulates that the observations (data points) are independent of each other; that is, the value of one observation does not influence or predict the value of another. This assumption is foundational for the validity of statistical tests, as many inferential statistics rely on the independence of observations to produce accurate standard errors, confidence intervals, and significance tests.

How to Check for Independence

Addressing Independence Issues

Violations of the independence assumption can often be addressed through modifications to the model or the data. Generalized Estimating Equations can be used for correlated data, often found in longitudinal studies, where observations from the same subject are correlated.

Use a Random Effects or Multi-Level Model. These models are particularly useful for handling data where observations are nested within higher-level groups (e.g., students within schools). By introducing random effects, these models can account for the lack of independence within groups while still assuming independence between groups. This is correct; using multi-level (hierarchical) regression is a way to address independence issues, especially in data with a nested or hierarchical structure

Alternatively, use fixed Fixed Effects Models in Panel data to control for unobserved heterogeneity when this heterogeneity is constant over time and specific to individuals or entities, thus addressing potential non-independence.

Ensuring the assumption of independence is crucial for the integrity of your statistical analysis. When this assumption is violated, it means the standard errors, p-values, and confidence intervals may not be reliable. Using appropriate models like multi-level regression or adjusting the analysis technique can help mitigate these issues and lead to more accurate and trustworthy conclusions.

9.1.4 Assumption 3: Homoscedasticity

The assumption of homoscedasticity, or equal variance, is pivotal in linear regression analysis and various other statistical modelling techniques. It posits that the variance of the error terms (residuals) is constant across all levels of the independent variables. In simpler terms, as the value of the predictor variables changes, the spread (variability) of the residuals remains consistent.

Importance of Homoscedasticity

Homoscedasticity ensures that the model estimates the dependent variable with uniform precision across the range of independent variables. It's crucial for several reasons:

Checking for Homoscedasticity

Homoscedasticity can be assessed visually and through statistical tests:

Interpreting the Residual Plot:

The residuals are the differences between the observed values and the values predicted by the model. Analyzing these differences can provide insights into the adequacy of the model, the presence of outliers, and whether the assumptions of linear regression are being met (e.g., linearity, homoscedasticity, independence, and normality of residuals).

A fitted residual plot is a specific type of residual plot that plots the residuals on the y-axis against the fitted values (or predicted values) on the x-axis. This plot can help assess:

While the fitted residual plot is a powerful diagnostic tool, it's usually not enough on its own. It is best used in conjunction with other diagnostic plots and statistics, such as:

Together, these diagnostics give a comprehensive view of the model's performance and assumptions. It's crucial to use multiple diagnostics because each highlights different aspects of the data and model fitting process.


Addressing Heteroscedasticity

When homoscedasticity is violated, there are several approaches to address the issue:

The assumption of homoscedasticity is integral to many statistical analyses for ensuring the accuracy and validity of the model's inferences. Identifying and addressing heteroscedasticity is crucial for the reliability of the conclusions drawn from statistical models.

9.1.5 Assumption 4: Normality of Residuals

The normality assumption posits that the residuals — the differences between the observed values and the values predicted by the model — are normally distributed for any given value of the independent variables.

Importance of Normality of Residuals

The normality of residuals is crucial for several reasons:

Checking the Normality of Residuals

To assess whether the residuals from a regression model are normally distributed, the following methods are commonly used:

Graphical Methods:

Statistical Tests:

Interept Q/Q Plots

Q-Q plots are graphical tools used to assess if a dataset follows a certain theoretical distribution, usually the normal distribution. When interpreting Q-Q plots, the main thing to look for is how well the points follow the straight line drawn on the plot. If the points lie on or very close to the line, it suggests that the data conforms well to the theoretical distribution. Deviations from this line indicate departures from the expected distribution: 

By assessing how closely the data points follow this reference line, Q-Q plots provide a visual means to evaluate the distribution assumptions underlying many statistical tests and models, guiding further analysis decisions.

------------------------------------------------------------------------------------------------------------------------Addressing Non-normality of Residuals

When residuals do not appear to be normally distributed, several strategies can be employed:

While the assumption of the normality of residuals is less critical for large sample sizes due to the Central Limit Theorem, which suggests that the sampling distribution of the estimate becomes approximately normal with sufficient sample size, it remains a vital diagnostic tool. Assessing and addressing the normality of residuals ensures the integrity of statistical inferences drawn from the model, enhancing the credibility and reliability of the findings.

9.1.6 Assumption 5: No multicollinearity

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning that one variable can be linearly predicted from the others with a substantial degree of accuracy. 

Importance of No Multicollinearity

The presence of multicollinearity can significantly impact the regression analysis in several ways:

Checking for Multicollinearity

To detect multicollinearity in a regression model, the following methods are commonly employed:

Addressing Multicollinearity

When multicollinearity is detected, several strategies can be considered to mitigate its impact:

The assumption of no multicollinearity is critical for ensuring the reliability and interpretability of the regression coefficients in a multiple regression model. By detecting and addressing multicollinearity, researchers can make more confident inferences about the relationships between independent variables and the dependent variable, enhancing the model's overall validity and usefulness.

9.1.7 Assumption 6: Auto-correlations

We won't say much about this here. This is only an issue when you have time-series data. Autocorrelation, also known as serial correlation, refers to the situation where residuals (errors) from a regression model are not independent but instead exhibit a pattern or correlation over time or sequence. This assumption violation is particularly relevant in time series analysis.

Checking Autocorrelation

Addressing Autocorrelation

Model Adjustments: Incorporating lagged terms of the dependent variable or residuals can account for the autocorrelation.

Alternative Models: Time series models like ARIMA are specifically designed to handle autocorrelation.

Briefly, detecting and correcting autocorrelation is crucial for ensuring the reliability of regression analysis, especially for time series data, where temporal patterns can significantly influence model accuracy and inference.

9.1.8 Check for outliers using Cooks Distance

Cook's Distance is a measure used in regression analysis to identify influential observations. It specifically relates to the potential impact of each observation on the estimated regression coefficients. An observation with a high Cook's Distance can indicate that it is an outlier or has high leverage, meaning it significantly influences the model's fit. 

While Cook's Distance itself does not directly correspond to one of the classical assumptions of linear regression (like linearity, independence, homoscedasticity, normality of residuals, no multicollinearity, and no autocorrelation), it helps in assessing the assumption of independence by identifying data points that disproportionately influence the model's parameters. Essentially, it's more about diagnosing potential problems with model fit and robustness rather than testing a specific assumption.

Influential observations identified by Cook's Distance might indicate a violation of the **homoscedasticity** assumption if these points are also outliers in the Y dimension, or they might relate to the **linearity** assumption if removing them significantly changes the relationship between variables. Thus, while Cook's Distance is not used to check a specific assumption directly, it is a valuable diagnostic tool in the broader context of validating the assumptions of independence and identifying data points that could lead to violations of other linear regression assumptions.


General Guidelines for Cook's Distance

Low Cook's Distance: A low value suggests that the observation has little to no influence on the regression model's coefficients. Typically, values of Cook's Distance below 0.5 are considered low, indicating that removing the observation would not significantly alter the model.

Moderate Cook's Distance: Values between 0.5 and 1 can be considered moderate. Observations falling in this range may warrant closer examination, as they could start to influence the model's predictions.

High Cook's Distance: Observations with Cook's Distance values greater than 1 are often considered highly influential. These points have a substantial impact on the model and could potentially skew the results. A common, more conservative threshold used to flag potentially influential points is 4/n, where n is the total number of observations in the dataset. This threshold adjusts for the size of the dataset, making it particularly useful for datasets of varying sizes.

Interpreting Cook's Distance

Interpreting Cook's Distance involves both quantitative assessment using these guidelines and qualitative judgment about the data and the context of the study. It's essential to:

In summary, while values of Cook's Distance greater than 1 are commonly flagged as potentially influential, the decision on what constitutes low or high should be informed by the specific context of your analysis, the distribution of Cook's Distance values across your data, and the size of your dataset.

9.2 Log and Quadratic Transformations

9.2.1 Log Transformation

Log transformation is a powerful mathematical technique widely used in statistical analysis to stabilize variance, normalize distributions, and make patterns in data more interpretable. It involves applying the logarithm function to a dataset, transforming each value x into log(x). This transformation can be particularly beneficial in various scenarios:

Addressing Skewness

Log transformation is especially useful for handling right-skewed data (where the tail on the right side of the distribution is longer or fatter than the left). By compressing the long tail and stretching out the left side of the distribution, log transformation can help normalize data, making it more symmetrical.

Stabilizing Variance

In regression analysis, heteroscedasticity (non-constant variance) can violate model assumptions, affecting the reliability of statistical tests. Log transformation can help stabilize the variance of residuals, ensuring that they're more uniform across the range of values, which is essential for meeting the homoscedasticity assumption.

Linearizing Relationships

Some relationships between variables may be multiplicative or exponential in nature, making it challenging to model them with linear methods. Applying a log transformation to one or both variables can linearize such relationships, allowing for a more straightforward analysis with linear models.

Interpretation Changes

After a log transformation, the interpretation of the coefficients in regression models changes. For example, in a model where the dependent variable is log-transformed, a one-unit change in an independent variable leads to a percentage change in the dependent variable, rather than a unit change.


Log transformation can be applied to a single variable, several variables, or even the entire dataset, depending on the analysis needs. Common bases for the logarithm include:

Practical Considerations

Log transformation is a versatile tool in data preprocessing and analysis, enhancing the appropriateness of statistical models and the clarity of data patterns. It's a valuable technique for analysts and researchers dealing with non-normal distributions, heteroscedasticity, or non-linear relationships.

9.2.1 Quadrtic Transformation

At this point, it is worth pointing out that the term linear does not refer to the line, but to the underlying mathematical function. This means you can model non-linear relationships - although these will be more difficult to interpret.  

Quadratic Transformation:

Using a quadratic transformation in a dataset for linear regression is a strategy to address situations where the relationship between the independent variable x and the dependent variable y isn't a straight line but rather has a curve. Here’s how it’s typically done, without diving into the math:

First, you'd look at your data, often by plotting x against y, to see if the relationship between them looks curved or has a peak or trough, suggesting a quadratic relationship might be a better fit than a straight line.

You transform the independent variable by squaring it, essentially creating a new variable by multiplying the variable by itself. This doesn't mean you alter your original data; you just add extra information (the squared term) to use in the analysis. In your linear regression model, you include both the original independent variable x and its squared version x*x. This allows the model to account for the curve by bending the line to fit the data points better.

Finally, you can analyze the Results. Examine the results to see if the model with the quadratic term fits the data better than a simple linear model. You're looking for a more accurate representation of how x predicts y. 

Interpreting the output: 

The presence of the quadratic term in the model tells you that the relationship between x and y changes at different values of x - it might increase initially and then decrease, or vice versa, creating a U-shaped or an inverted U-shaped curve. In essence, a quadratic transformation allows a linear regression model to capture more complex, non-linear relationships between variables, improving the model's accuracy and explanatory power without departing from the linear regression framework.

Below you see how different transformations change the line. 

9.3 Additional Learning Materials