Tutorial 5:
Reducing your Data
Overview
In Tutorial 1 you learned about transforming and computing variables into new variable. These are simple changes e.g. by creating a new variable Age_Cat (18-34; 35-44; etc) based on the continuous variable Age.
In this tutorial we will take this a step further. Here you will learn how to reduce multiple variables by essentially grouping them together and reducing them to a smaller set of variables called 'Components' or 'Factors'. We will cover the following three methods:
Reliability Analysis
Principle Component Analysis
Exploratory Factor Analysis
Additonal Learning Materials
Easy: Cole, C (2019) Statistical Testing with Jamovi and JASP: Open Source Software. VOr Press. Read: Chapter 13.
Moderate: Navarro, D & Foxcroft, D (2022) Learning Statistics with Jamovi. Read: Chapter 15
Variables Required
Reliability/Principle Component/Exploratory Factor Analysis
For each of them you should either use a series of Continuous or Ordinal variables
It is important to note, that each of the variables selected should be be using the same measurement scale.
For the examples we will be using the following dataset:
Skoczylis, Joshua, 2021, "Extremism, Life Experiences and the Internet", https://doi.org/10.7910/DVN/ICTI8T, Harvard Dataverse, Version 3.
Concepts: Reducing Your Data
When you have a large dataset these methods come in handy as they allow you to reduce some variables variables into fewer 'Components' and/or 'Factors' - essentially new variables. In this tutorial we will be using the Extremism dataset to create a series of new variables using these techniques.
Don't worry the maths behind this is done by the compute. Most Stats programmes make it very easy to save the components/factors as new variables.
These new variables can then be used in further analysis of your data.
Reliability Analysis
Let's say you have attempted to measure a concept by asking participants a number of questions on the topic. You now have a series of questions that measure the concepts. This is great, but it would be much easier if you had on continuous variable that measures the concepts.
Reliability Analysis is the simplest way of reducing your data. Before you put combined the variables into one, the Reliability Analysis will check there is a statistically reliable relationship between the individual variables you want to combine.
The reliability is measures using Cronbach's Alpha. This measure tells you how reliable your desired combination of variables is. Usually, if you get a Cronbach's Alpha value of below 0.6 you should remove it the variable from your analysis and future scale.
Once you are happy with the results, Jamovi will save the new variable using either the Mean or Sum score. You should also consider standardising this variable (using Transform) using z-scores.
You can now use this score in future analysis of the data. The video below provides you with a short video introduction.
Principle Component (PCA) v Exploratory Factor Analysis (EFA)
The idea behind PCA and EFA is similar - you are looking to reduce your data and combined a series of variables into a new variable(s), also called factors. The math behind the two is different though.
In brief, if you have a theory or theoretical framework that explain your new factors alway opt for EFA. In the Social Sciences, this would usually be your standard go do test to reduce your data. For example, in the extremism dataset we specifically asked a number of questions to gauge people's extremism attitudes. So an EFA would be appropriate.
If you just want to check if a series of variables can be be combined, but you have no theory, then you might consider using a PCA instead.
If you want to find out more about how a PCA works watch this video.
EFA and PCAs have a number of steps you need to go through before you can save your factors as a new variable. Below we will take you through each of these steps.
Extraction (not applicable to PCA):
Extraction is the process of getting factor loadings (correlation coefficients between observed variables and latent common factors). The Maths is done by the computer, all you have to do is chose the method from the drop-down box.
We recommend selecting either the Principle Axis or Maximum Likelihood factoring. The former is probably your best choice.
Rotation:
There are two types of rotation methods Oblique and Orthogonal Rotation. Oblique rotations assumes that there is a correlation between the factor. For most social science subjects this is the most appropriate method of rotation. There a number of available rotations, which one you select doesn't really matter (they are different ways of calculating the same thing). These are:
Promax
Oblimin
Simplimax
Orthogonal Rotations assume that your factors are not correlated. Again, which one you select doesn't really matter. Your options are:
Quartimax
Varimax
Assumptions:
Surprise surprise, there are some assumptions. We won't bore with the details. But you will have to run the Bartlett's test of sphericity. If you get a result of <.05 your assumptions are met (note this is the opposite to the t-test and ANOVA).
You should now need to check your KMO Measures of Sampling adequacy (MSA). Essentially you get a list with each of the variables you want to include in your analysis. For each you should get a MSA score. If the score is below 0.6 then you should remove the variable from your analysis.
As you can see from tables below, the assumption is met (p-value of <.001). Based on the MSA scores we need remove Place_World (0.508) and Understand_Personality (0.524).
If you have a small dataset (less than few 100, then use Parallel Analysis to decide the number of factors.
In most cases you should use Eigenvalues to decide how many factors to create. Each factor with an Eigenvalue of more than one should be included. You can check how many Eigen Factors have a value of above 1 in the Scree Plot and Eigenvalue table.
You can also check whether your pre-determined number of factors work.
Finally, you can also look at your Factors. You will get a table that lists all of the new factors and how each of the variables map onto the variables. The right-hand unique column shows how far each variable's variance is unique to itself and thus not shared with the other variables. A high uniqueness score means more limited relevance to the factor.
The other number reported is the factor loadings. The higher the loadings the better. Factor loadings are used to calculate individuals factor scores.
Based on this you should now be able to reduce your data using some common tests