Tutorial 6: Basic Correlations

Tutorial Overview :

In the previous tutorials, you learned how to describe your data. Understanding your data is an important step; however, descriptive statistics offer limited insights beyond your immediate dataset. From here on, we will start making inferences about what your data reveals concerning the wider world. Utilising the sample data, we can employ a variety of tests to ascertain if there are correlations within our data and what these correlations indicate about the environment around us.

In this tutorial, we will cover the following concepts and tests:

Grasping these concepts is crucial as they will aid your understanding as we progress through future tutorials.

Additionally, we will explore the differences between parametric and non-parametric data, which is essential for selecting the appropriate statistical tests for your analysis.

6.1 Understanding Probabilities

Probabilities are a fundamental concept in statistics, representing the likelihood of a specific event occurring within a set of possible outcomes. They are expressed as a number between 0 and 1, where 0 indicates an impossibility and 1 signifies certainty. The higher the probability of an event, the more likely it is to occur.=

Why We Use Probabilities in Statistics:

Predicting Outcomes: Probabilities help us predict the likelihood of various outcomes. This is crucial in fields like weather forecasting, finance, and sports betting, where predicting future events based on historical data is essential.

Informed Decision Making: By understanding the probabilities of different events, decision-makers can weigh the risks and benefits of various actions more accurately. This applies to healthcare, policy-making, business strategies, and everyday life decisions.

Probabilities allow us to quantify and understand randomness. Many phenomena in nature and human behaviour exhibit randomness, and probabilities help us model and analyze these effectively. In statistics, probabilities are used to make inferences about a population based on a sample. This involves estimating parameters, testing hypotheses, and drawing conclusions about the broader population from a limited set of observations.

In summary, probabilities are a cornerstone of statistical analysis, providing a systematic way to quantify uncertainty and make informed decisions based on data.

6.1.1 Basic Probablities 

While we will explore how probabilities are calculated, you will not have to do this yourself. But you do need to understand the underlying concepts.  So with that said, let's break down the concept of probability into a very simple and understandable explanation. Probability is essentially a way to quantify how likely it is that something will happen. 

Calculate Probability

Imagine you have a regular six-sided die, and you want to calculate the probability of rolling a 5. Since there are six possible outcomes (1, 2, 3, 4, 5, 6) and only one of these outcomes is a 4, the probability of rolling a 5 is 1 out of 6 or 0.166 or 16.6% (0.166*100).

Key Points to Remember

Total Number of Outcomes: This is the total count of all possible results in a given scenario. For a die, it's 6; for a coin, it's 2. etc

Favourable Outcomes: These are the outcomes you are calculating the probability for. For example, rolling a 4 on a die or flipping heads on a coin.

Probability Formula: The probability of an event happening is calculated using the formula: Probability = Favourable Outcomes/Total Number of outcomes

By applying this basic formula, you can easily calculate the probability of various simple events. Remember, the concept of probability is much broader and can get complex, but understanding these basics is a great starting point.

6.1.2 Joint Probability

What happens if we have two independent events and we want to know the probability of both of let's say getting 2 sixes? Let's go back to the dice. When you roll a die twice, the outcome of each roll is independent of the other, meaning the result of the first roll does not affect the result of the second roll. This scenario introduces the concept of joint probability for independent events.

Joint Probability for Independent Events

The joint probability of two independent events, A and B, happening is the product of their individual probabilities. This is often expressed as:

Probabilities ( A and B) = Probability(A) * Probability (B)

Example: Rolling a Die Twice

Let's consider two events:

Since a die has six faces, the probability of rolling a 4 on any given roll is 1/6. Using the rule for independent events, the probability of rolling a 4 on the first roll and again on the second roll is:

P(A (first roll and) B (second Roll)) = P(A)*P(B) = 1/6 * 1/6 = 1/36 > 0.0266667 > 2.7%

This calculation shows that rolling a 4 twice in a row is quite unlikely.

Another Scenario: Rolling a 4 on the First Roll and Any Even Number on the Second Roll

Let's calculate the probability of this scenario to illustrate how to handle different conditions.

The joint probability is:

P(A and C) = P(A) * P(C) = 1/6*1/2 = {1}{2} = 1/12 > 0.08333 > 8.3%

Key Points to Remember

Independence: The outcome of the second roll does not depend on the outcome of the first roll. Each roll is an independent event.

Joint Probability: For independent events, the joint probability is the product of their individual probabilities.

Understanding how to calculate probabilities with multiple rolls or events is fundamental in statistics and helps in assessing the likelihood of various outcomes in more complex scenarios.

6.2 Hypothesis Testing & P-Values

6.2.1 Hypothesis Testing

Hypothesis testing is a fundamental aspect of statistical analysis, allowing us to make inferences about our data and, by extension, the world around us. While descriptive statistics provide insights into individual variables, hypothesis testing delves deeper, exploring the relationships and differences between variables to validate or refute assumptions about our data.

Basics of Hypothesis Testing

At its core, a hypothesis is a statement positing a particular relationship or condition, such as "All cars are red." For a hypothesis to be meaningful in a statistical context, it must be testable. This involves setting up two contrasting statements: the Null Hypothesis (H0) and the Alternative Hypothesis (Ha).

Null Hypothesis (H0): This posits the absence of a statistically significant relationship between the variables in question. It represents the default or status quo, serving as the claim to be tested. For instance, it might assert that there is no difference in a particular attribute across groups.

Alternative Hypothesis (Ha): This serves as the counterclaim and suggests that there is a statistically significant relationship between variables. It is what you might believe to be true or hope to prove.

Two other terms that you should learn are directional and non-directional hypotheses. Directional Hypotheses specify the direction of the expected relationship or difference between variables. For Example: 

Non-directional Hypotheses do not specify the direction of the relationship or difference but simply state that a difference or relationship exists. 

Testing Hypotheses

To evaluate these hypotheses, you select an appropriate statistical test based on the nature of your data and hypotheses. The outcome of this test, expressed through p-values and confidence intervals, helps determine whether to reject or accept the Null Hypothesis. 

P-values quantify the probability of observing the collected data, or something more extreme, assuming the Null Hypothesis is true. A low p-value indicates a higher chance that the Null Hypothesis can be rejected in favour of the Alternative Hypothesis.

Confidence Intervals provide a range of values within which the true value of the parameter is expected to lie, with a certain degree of confidence. This can offer additional insight into the relationship between variables.

Important Considerations

It's crucial to remember that while the p-value and your decision on the hypotheses can indicate the presence or absence of a statistically significant relationship, they do not inform about the strength or practical significance of the relationship. For that, you would look into effect size measures and other statistical analyses.

Type 1 Error (False Positive)

A Type 1 error occurs when the Null Hypothesis (H0) is incorrectly rejected when it is actually true. This is akin to a false alarm or a false positive result. In the context of hypothesis testing, it means that you conclude there is a significant effect or difference when, in reality, there isn't one.

Example: 

If your null hypothesis states there is no difference in income between men and women and the alternative Hypothesis claims the opposite,  a Type 1 error would mean concluding there is a difference, when actually there is no difference.

Type 2 Error (False Negative)

A Type 2 error happens when the Null Hypothesis is incorrectly accepted when the Alternative Hypothesis (Ha) is actually true. This error is referred to as a false negative. It means missing out on identifying a genuine effect or difference.

Example: 

If your null hypothesis states there is no difference in income between men and women and the alternative Hypothesis claims the opposite,  a Type 2 error would mean concluding there is no difference, when actually there is.

Significance Levels also called P-Values

This is a threshold set by the researcher to decide how much risk of a Type 1 error is acceptable. In the Social Sciences, this is commonly set at 0.05 (often also referred to as alpha). This indicates that the research accepts that there is a 0.05 (or 5%) chance of falsely rejecting the Null Hypothesis is deemed acceptable. You can of course set lower levels such as 0.01, or even 0.001. 

Balancing Type 1 and Type 2 Errors

In statistical analysis, there's often a trade-off between minimizing Type 1 and Type 2 errors. Reducing the risk of one can increase the risk of the other. The challenge for researchers is to design their studies and choose significance levels that balance these risks appropriately, taking into account the context and consequences of potential errors.

Key Takeaways

Type 1 Error: Rejecting a true Null Hypothesis (false positive).

Type 2 Error: Failing to reject a false Null Hypothesis (false negative).

You must balance the risks of Type 1 and Type 2 errors, often by setting an appropriate significance level and designing studies with sufficient power to detect true effects.

6.2.2 Understanding P-Values

Understanding probabilities is crucial to grasping the concept of p-values, which are a fundamental aspect of statistical hypothesis testing. A p-value helps you determine the significance of your results in a statistical hypothesis test.

What is a P-value?

Inferential statistics always involve calculating p-values, which are crucial in hypothesis testing. A p-value is a number between 0 and 1 that indicates the probability of observing the collected data (or something more extreme) under the assumption that the null hypothesis is true. It helps in deciding whether to reject or accept the null hypothesis. The smaller the p-value, the stronger the evidence against the null hypothesis, making us more inclined to reject it.

Zero (0) indicates an event is impossible under the null hypothesis.

One (1) indicates certainty that the event will occur under the null hypothesis.

The closer the p-value is to 0, the stronger the evidence against the null hypothesis. In the social sciences, a p-value of 0.05 or lower is typically sought. This threshold signifies a 5% probability that the observed results could occur by chance, meaning there is a 95% confidence level that the results are not due to random chance. This 0.05 threshold, however, is somewhat arbitrary and can be adjusted based on the acceptable risk level of the research. For instance, a threshold of 0.1 implies a 10% risk of a false positive is acceptable.

It's important to note that the choice of significance level (e.g., 0.05) depends on the context of the research and the researcher's willingness to accept the risk of a Type 1 error (falsely rejecting the null hypothesis). While 0.05 is standard, especially in social sciences, different fields may choose different thresholds based on the norms and consequences of errors in those fields.


Interpretation

Low P-value (< 0.05): A p-value below the threshold of 0.05 is typically considered statistically significant. This indicates that the probability of observing the data (or more extreme) assuming the null hypothesis is true is less than 5%. It suggests that such an extreme result is unlikely to occur by chance alone, leading to the rejection of the null hypothesis. In practical terms, you conclude there is statistically significant evidence to support the existence of a difference or effect.

High P-value (≥ 0.05): A p-value at or above 0.05 suggests that the observed data is not sufficiently unlikely under the null hypothesis. This means there isn't strong statistical evidence to reject the null hypothesis. Therefore, you might conclude that the data does not provide enough evidence to support a difference or effect. However, it's crucial to note that this does not prove the null hypothesis is true; rather, it indicates that the data does not provide strong evidence against it.

Important Considerations

Context Matters: The choice of a 0.05 threshold is conventional but arbitrary. Some fields or specific studies may require more stringent criteria (e.g., 0.01) or might accept a less strict threshold (e.g., 0.1), depending on the context, the stakes of making errors, and the norms of the field.

Effect Size and Practical Significance: While the p-value can indicate whether an observed effect is statistically significant, it does not convey the size or practical importance of the effect. Additional measures, such as effect size, are necessary to understand the magnitude and relevance of the findings.

Not Proof of No Effect: A high p-value does not prove that there is no effect or difference. It simply means that the evidence is not strong enough to reject the null hypothesis given the data. Other factors, such as sample size and study power, play critical roles in interpreting p-values.

How to report p-values in your written work:

 When reporting p-values in written work, clarity and precision are key. Here are concise guidelines on how to do so:

Example of reporting:

"In analyzing the data, we found a statistically significant difference in the extremism score of men and women (p = 0.032). This suggests ... "

By following these guidelines, you ensure that your reporting of p-values contributes meaningfully to the reader's understanding of your research findings. If you report a series of significant and non-significant p-values in a table, consider making the significant p-values bold or put an * before or after to make them stand out. 

A significant p-value in hypothesis testing indicates that the observed difference or effect between groups is statistically unlikely to have occurred by chance. However, this significance does not tell you how different they are (it can be tiny or huge) unless the test specifically assesses the relationship's strength and direction. Therefore, while a low p-value suggests statistical evidence against the null hypothesis, indicating a meaningful difference, it does not automatically mean that there is a correlation, as correlation and significance are distinct concepts, each with specific interpretations based on the context of the test conducted.

6.3 Parametric v Non-Parametric Tests

Remember the different types of data we discussed earlier? The type of analysis you can conduct depends on the nature of your data. Broadly, analyses fall into two categories: those suitable for parametric data and those for non-parametric data. The characteristics of your data will dictate which type of statistical test you should use. It's crucial to verify that your data meets the assumptions required by the selected test. If your data doesn't align with these assumptions, you should choose a more suitable test. In certain situations, you might need to transform your data to meet these assumptions. This can involve applying linear or non-linear transformations, or converting your data into ordinal or categorical variables, to enable the appropriate statistical analysis.

In this section you will learn about when to use several different tests - these tests will be covered in more detail in later sessions.  

Why Does This Difference Matter?

Choosing between parametric and non-parametric methods affects how you analyze data and what conclusions you can draw. Parametric methods can be more powerful and precise, giving you stronger insights when their assumptions are met. Non-parametric methods are more versatile and robust, allowing you to analyze data without strict assumptions. Your choice impacts the effectiveness of your statistical analysis, potentially affecting decisions in science, business, and many other fields. It's crucial to match the method to your data's nature and the assumptions you can confidently make, ensuring your conclusions are valid and reliable.

6.3.1 Parametric Data

Parametric data refers to datasets that assume the underlying data distribution is known and usually follows a normal distribution (bell curve). This assumption about the distribution allows for the application of parametric statistical tests, which can offer more powerful and precise insights when the assumptions are correctly met. The key characteristics of parametric data include:

Because parametric tests rely on these assumptions, they are best used when your data is from a large sample size (which, according to the Central Limit Theorem, tends to be normally distributed as sample size increases) or when you have prior evidence to believe your data follows a normal distribution.

Types of Parametric Tests

Here are some of the most common parametric tests you might use:

T-test and ANOVAs have been placed last here, as they are special applications of Linear Regression. In this course,  you will learn about Linear Regression first, as this will give you a really good understanding of the other tests.  

Importance of Meeting Assumptions

Note: Most statistical programmes allow you to test for assumptions before you run the tests. Take the time to do so.  

The accuracy of parametric tests hinges on the validity of their underlying assumptions. If these assumptions are violated, the results may not be reliable. For instance, if your data significantly deviates from normality, using a parametric test could lead to incorrect conclusions. In such cases, transforming the data or opting for non-parametric alternatives might be necessary.

Parametric tests are powerful tools in statistical analysis, offering precise estimates and inferences about the population when their assumptions are met. They are particularly valuable in experimental designs and quantitative research where the assumptions about the population distribution can be justified.

6.3.2 Pearsons Correlation Coefficient versus Linear Regresssion

Pearson correlation and linear regression are both statistical methods used to analyze the relationship between two continuous variables, but they serve different purposes and provide different insights:

Pearson Correlation Coefficient (r):

This test measures the strength and direction of the linear relationship between two variables. A single value ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Pearson's r focuses on the correlation or degree of association between the variables, without implying causation. It doesn't differentiate between dependent and independent variables; both variables are treated equally.

When you want to quantify how strongly two variables are related in a linear fashion without making predictions.

Linear Regression:

Linear regressions predict the value of a dependent variable based on the value of one or more independent variables, assuming a linear relationship between them. An equation of the form y = mx + b for simple linear regression (one independent variable) or more complex equations for multiple linear regression (multiple independent variables). This includes coefficients for each predictor, indicating how much the dependent variable is expected to increase when that predictor increases by one unit, holding all other predictors constant.

Linear regression identifies the best-fitting line through the data points and can be used for prediction. It explicitly distinguishes between independent (predictors) and dependent (outcome) variables. When you want to predict the value of one variable based on the values of one or more other variables and understand the nature of their relationship.

Key Differences:

Pearson correlation quantifies the degree of a linear relationship, while linear regression models the relationship to predict outcomes. Pearson treats both variables symmetrically as equals, whereas linear regression identifies one variable as dependent and one or more others as independent.

Only linear regression provides a predictive model that can be used to estimate the dependent variable's values based on the independent variables.

In summary, Pearson correlation is best for understanding the strength and direction of a linear relationship between two variables, while linear regression is suited for predicting the value of a dependent variable based on one or more independent variables.

6.3.3 Non-Parametric Data

Non-parametric data refers to data that does not assume an underlying distribution. This can be due to various reasons such as:

Non-parametric methods are more flexible and don’t require the data to meet strict assumptions about the population parameters. This makes them very useful in dealing with real-world data that doesn’t neatly fit the ideal conditions for parametric tests.

Types of Non-Parametric Tests (Ordinal Data):

Type of Non-Parametric Tests (Nominal or Nominal v Ordinal)

Why Non-Parametric Data and Tests Matter:

Non-parametric tests are crucial in statistical analysis because they provide the means to analyze data without the strict assumptions required by parametric tests. This flexibility allows researchers to make meaningful inferences from a wide range of data types and distributions, enhancing the robustness and applicability of statistical methods across various fields of study. By using non-parametric tests, analysts can ensure their findings are valid even when the data do not meet the stringent requirements of parametric methods, making their conclusions more reliable and widely applicable.

If you want to watch a really nice introduction on this topic and understand which test to use watch this video

6.4 Confidence Intervals, Effect Sizes & Correlations

6.4.1 Confidence Intervals

Let's assume you are interested in the average extremism score of the UK population. It is impractical to measure every single person's extremism score, so instead you measure the extremism score of a sample population. 

Based on this sample you can calculate the average extremism score. Let's assume you took a different sample, chances are you would get a different average score. So how can we be confident that our results are close to the true population average extremism score? 

This is where confidence intervals come into play!

Understanding Confidence Intervals

A confidence interval gives you a range of values within which the true average height is likely to fall. It's like saying, "We are 95% confident that the true average extremism score of the population is between x and y. 

Here's a simple breakdown of how they work (in reality your statistical software will calculate them for you):

Choose Your Confidence Level: This is usually 95%, but it can be 90%, 99%, or any other value depending on how confident you want to be. A 95% confidence level means that if you were to take 100 different samples and compute a confidence interval from each of them, about 95 of those intervals would contain the true average height.

Calculate the Sample Mean (Average): This is the average extremism score from your sample.

Calculate the Margin of Error: This involves some statistics, including the standard deviation (how spread out the extremism score is) and the sample size. The larger your sample, the smaller your margin of error will be.

Create the Interval: You take the sample mean and add and subtract the margin of error to it. This gives you the lower and upper bounds of your confidence interval.

So, if your sample extremism mean is 0.87 and you are using 95% confidence intervals you can be 95% confident that the mean extremism score of the population is somewhere between 0.80 to 0.93. You can, of course,  increase your confidence by selecting higher confidence intervals e.g. 99%. 

In essence, confidence intervals provide a way to quantify the uncertainty of your estimate. They don't guarantee that the true average falls within the interval, but they give you a range that is likely to contain the true average based on your sample data and the level of confidence you've chosen. To watch a really easy explanation of how, when and why we use them watch this video here

Differences between Standard and Confidence Intervals

It is worth pointing out the difference between the Standard Error (covered earlier, and Confidence Intervals). The standard error (SE) and confidence intervals (CIs) are related but distinct concepts in statistics. The standard error measures the variability or precision of a sample statistic (such as the sample mean) in estimating a population parameter, reflecting how much the statistic would vary across different samples. On the other hand, a confidence interval provides a range of values, based on the sample statistic and its standard error, within which we are a certain percentage confident (e.g., 95%) that the true population parameter lies. While the standard error quantifies the uncertainty of the sample mean as a single estimate of the population mean, the confidence interval expands on this by offering a range that is likely to capture the true population mean, incorporating both the estimate's uncertainty and the desired level of confidence.

6.4.2 Effect Size and Correlation Coefficients

Effect size and correlation coefficients are both measures used in statistics to quantify relationships between variables, but they serve different purposes and convey different types of information. Note, these only become important if your p-value has reached your set threshold, e.g., 95%:

Effect Size

Effect size measures the magnitude of a relationship or the strength of an effect in a population, without being influenced by the sample size. It provides a quantitative description of the difference between groups or the size of a relationship within a study, making it easier to understand the practical significance of research findings. Common examples include Cohen's d, which quantifies the difference between two means relative to the standard deviation, and eta squared, which measures the proportion of variance explained in the dependent variable by the independent variable.

Test such as t-test and the non-parametric test will have effect sizes available to you. 

Correlation Coefficients:

Note: These are used to measure the strength of a linear relationship - so use them for continuous and ordinal data

Correlation coefficients quantify the strength and direction of a linear relationship between two continuous variables. They range from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The most common correlation coefficient is Pearson's r, which measures the degree to which two variables linearly relate to each other (Continuous). For ordinal data use Spearman's or Kendall's Tau B. 

In summary, effect size is broader and can describe the magnitude of any type of effect, not just linear relationships, making it useful for understanding the practical importance of results. Correlation coefficients specifically measure the linear relationship between two variables, providing insight into how closely two variables change together.

For an in-depth introduction to Pearson's Correlation test watch this video.

6.4.3 Contigency Tables, Chi-Square Tests and Measures of Association

If you want to compare nominal data, or ordinal data using various levels, you would use a contingency table. We covered how to create a contingency table and how to interpret the data in the previous tutorial. Here we will understand in more detail how we can use this test for hypothesis testing.

Let's say we want to explore the relationship between Extreme Yes/No and Share Nothing with Society. To figure this out, we use a contigency table where on one side we list whether someone is extreme or not and other side list whether they agreed with the statement Violence is effective in creating a new Society. This data can be explained in counts and/or percentages. This table helps you organize the counts of different combinations of categories (e.g., Extreme, Strongly Disagree, Extreme, Somewhat Agree, etc).

Looking at the counts and/or percentages can give you some idea about the relationship. Using a Chi-square test coupled with measures of association allows us to test our hypothesis and measure how strong the relationship is. This test looks at the numbers in your table and calculates whether the differences you see are just due to chance, or if they're likely reflecting a real pattern or relationship. In simple terms, the Chi-Square test helps you figure out if the observations you made are likely true for the broader population or if they just happened by accident in your survey. 

For nominal data, where categories do not have an inherent order, several measures of association are used to analyze relationships between variables. Here are some common ones:

Interpreting the output

The Chi-Square Test of Independence tests whether there is a significant association between two nominal/ordinal variables (if both are ordinal, consider using Spearman's test). You can also add levels - but avoid having too many levels. The p-value will tell you whether you should accept or reject your null Hypothesis. 

You can measure the strength of the association between the variables using the following tests (more are available).  

These measures allow researchers to understand and quantify the strength and direction of associations between nominal variables, providing insights into the relationships in categorical data.

Let's use what we know and apply it to the above example using the Extreme Yes/No and Violence is effective to createing a new Society variables. Here is the Hypothesis we will test: 

H0: There is no difference between being extreme or not and their attitudes to whether Violence is an effective way to achieve a new society. 

Ha: There is a difference between being extreme or not and their attitudes to whether Violence is an effective way to achieve a new society. 

Based on the p-value returned (<.001) we can reject the H0 and accept Ha. We can state that Gender does impact a person's attitudes towards whether or not violence is an effective way to achieve a new society. 

The Cramer's V indicates that there is a strong correlation (0.21) between these Gender and their attitudes towards violence being an effective means to achieve a new society. 

Interpreting Cramer's V

Interpreting Contingency Coefficient

6.5 General Reading Materials