Tutorial 7: Reducing Data

Overview: 

One of the first things you learnt was transforming data. In this tutorial, we will take data transformation a step further. Here, you will learn how to reduce multiple variables by grouping them and reducing them to a smaller set of variables called 'Components' or 'Factors'. We will cover the following three methods:

Once you have created new factors, you can use them as part of your wider analysis of the data. 

7.1 Ordinal v Continuous

7.1.1 Treating Ordinal Variables as Continuous

Using ordinal variables in Reliability Analysis (RA), Principal Component Analysis (PCA), and Exploratory Factor Analysis (EFA) is a topic of discussion among researchers. Ordinal variables, by definition, represent categories with a meaningful order but without a consistent interval between them. Underlying these tests, a correlation matrix is used to work out the correlations between variables. When dealing with ordinal data, you should normally use non-parametric alternatives within these tests (e.g., Spearman's rank correlation or polychoric and polyserial correlations). These options are not always available in standard statistical programs. Below we outline when you can use ordinal variables and treat them as continuous as part of your analysis.

Can You Treat Ordinal Variables as Continuous?

Short Answer: Yes, but with caution. It's common practice, especially when the ordinal scale has many levels (usually 5 or more), to treat ordinal variables as continuous in EFA. This approach assumes that the intervals between the categories are approximately equal, which may not always be accurate but is often considered a reasonable approximation for practical purposes.

Considerations:

While it's not uncommon to treat ordinal variables as continuous in Explanatory Factor Analysis, it's important to be mindful of the limitations and potential inaccuracies this approach might introduce. Consider the nature of your data, the number of response categories, and the purpose of your analysis when making this decision. Alternative methods that respect the ordinal nature of the data may offer more precise insights but might also require more complex analyses.

7.2 Reliability Analysis, Sum & Mean Scores

7.2.1 Reliability Analysis

What is Reliability Analysis (RA)?

In simple terms, reliability analysis, particularly using Cronbach's Alpha, helps you check how trustworthy your survey or test is. You use it to ensure that the questions in your survey consistently measure the intended concept. The closer Cronbach's Alpha is to 1, the more reliable your measurement tool is. However, aim for a balance because extremely high values might suggest redundancy among the questions. You can use reliability analysis to check whether the questions you want to combine are reliably measuring similar things. Note, that the questions you want to combine should all be using the same scale.  

When to use RA?

Use a reliability analysis when you:

How it RA works:

Reliability analysis uses something called Cronbach's Alpha. This measure looks at the internal consistency of a survey or test. Internal consistency refers to how closely related a set of items are as a group. It's like checking if every choir member sings in harmony with the others to produce a coherent sound.

Cronbach's Alpha calculates the average correlation between all items in your test or survey that are designed to measure the same concept. When you want to combine data, you will use this on the variables you are interested in combining. 

Briefly, when you run an RA it will do the following :

Understanding Cronbach's Alpha

The essence the Cronbach Alpha provides a single number that indicates the reliability of a scale, based on the interrelationships among the items within the scale. Here's what happens under the hood:

Interpretation Revisited with Insight

A higher Alpha indicates that the items have a high average correlation, suggesting good internal consistency. Your test or survey reliably measures a concept. a lower Alpha suggests that the items do not correlate well with each other on average, indicating poor internal consistency. Your survey might be measuring multiple concepts, or some items may not fit well.

Cronbach's Alpha is a number between 0 and 1 that you get from performing reliability analysis. Here's how to interpret it:

Summary

Cronbach's Alpha is a powerful tool in reliability analysis for assessing the internal consistency of a survey or test. Calculating the average correlation between all pairs of items the test provides a clear indication of whether your variables together reliably measure the intended concept. Remember, while a higher Alpha value is generally desirable, extremely high values should prompt a review of item diversity and survey length to avoid redundancy and ensure respondent engagement.

Above you can see the output of a reliability analysis. Here we are checking whether or not questions on violence reliably measure similar things. The aim here is to combine these variables into one variable called Extremism With an Alpha of 0.94, we can be pretty confident that these variables can be combined into one or more new variables.

7.2.2 Sum and Mean Scores

Sum scores and mean scores are two methods for creating composite measures from multiple items. Each method has its own applications and implications.

Sum Scores

A sum score is the total score obtained by adding up the responses to individual items. For example, if your survey has five questions about political engagement, each rated from 1 (low engagement) to 5 (high engagement), a respondent’s sum score would be the total of their responses to these five questions. We use Sum Scores when each item is considered equally important and contributes equally to the construct being measured. Sum Scores provide a new variable that is normally quite easy to interpret.  Higher sum scores directly indicate higher levels of the measured construct. If you're measuring political participation, you might ask respondents how often they vote, attend rallies, volunteer for campaigns, and discuss politics. You would use a sum score if each of these activities is seen as contributing equally to their overall political participation.

Mean Scores

A mean score is the average score obtained by adding up the responses and then dividing by the number of items. Continuing with the previous example, you would calculate the mean score by adding the responses to the five questions and then dividing by five. The mean score allows you to easily compare your variable, as the mean score is not influenced by the number of items. Also, when you have missing data, the mean score can be used as they are less affected by missing responses. (provided there are enough remaining items to calculate a meaningful average. For example, if you have three scores measuring trust (e.g., trust parliament, judiciary and the police), you might have a different number of questions for each institution. A mean score for each institution allows for more direct comparison of trust levels, regardless of the number of questions related to each.

Sum Score vs. Mean Score

The choice between sum scores and mean scores often comes down to the scale construction and research question. Sum scores are straightforward but can be misleading if the number of items varies across scales or if some items are more important than others. Mean scores are more flexible and allow for comparability but can dilute the impact of particularly important items unless weighted means are used.


Standardising Scores

Turning mean and sum scores into standardized scores, such as z-scores, is a common statistical practice and is particularly useful for comparing scores from different groups or where these are measured on different scales. Let's discuss why and how this is done.

Standardized Scores

Remember z-scores? Standardized scores convert individual scores into a common scale with an understood mean and standard deviation. The most common standardized score is the z-score, which represents the number of standard deviations a given score is from the mean of the dataset. Z-scores allow for comparison across different tests or surveys. For instance, if one political survey uses a 5-point scale and another uses a 10-point scale, standardizing these allows for direct comparison between them. Z-scores also help when the original units of measurement are not directly comparable or when they have different variances (e.g. when you are comparing income across different countries). Z-scores also allow you to easily identify outliers, as any z-score above 3 or below -3 is typically considered unusual. Finally, many statistical analyses require data to be on a standardized scale. Z-scores are used in regression analysis, principal component analysis, and other multivariate methods.

Your statistical programmes should be able to standardise the Sum or mean score easily. For references, this is how they are calculated:  

For Sum Scores: z = (sum_score - mean of all sum_scores) / std_dev_sum_scores

For Mean Scores: z = (mean_score - mean_mean_score) / std_dev_mean_scores

The resulting z-scores have a mean of 0 and a standard deviation of 1. This means that if a score has a z-score of 0, it's exactly average. A positive z-score indicates a value above the mean, while a negative z-score indicates a value below the mean.

Standardizing scores is a fundamental technique in data analysis, providing a way to compare and interpret scores from different scales or groups on a level playing field. It’s particularly useful in fields like social science, where scales and constructs can vary widely between studies and populations.

Before you combine data using Sum Scores or Means you should run a Reliability Analysis to ensure that your data is all measuring the items on the same scale. 

7.3 PCA & EFA

7.3.1 Princple Component Analysis or Exploratory Factor Analysis?

Exploratory Factor Analysis (EFA) and Principal Component Analysis (PCA) are both techniques used for data reduction, but they have different goals and methods. Let's break them down into simpler terms:

Principal Component Analysis (PCA):

The goal of the PCA is to reduce the dimensionality of a dataset while retaining as much of the variance (information) as possible. It's like summarizing a book into key points so that you can still understand the main story without reading every page. PCA finds new variables that are linear combinations of the original variables. These new variables are chosen in a way that the first principal component has the highest possible variance (captures most information), and each succeeding component has the highest variance possible under the constraint that it is orthogonal (at a right angle) to the preceding components. You use a PCA when you want to simplify data analysis by reducing the number of variables but still keeping as much information about the original dataset as possible.

Exploratory Factor Analysis (EFA):

EFA on the other hand, seeks to uncover the underlying structure of a dataset. It tries to identify the underlying factors (latent variables) that explain the patterns of correlations among the observed variables. Think of it as discovering hidden themes or topics in a collection of essays. An EFA groups variables that are correlated with each other but not with other groups of variables. These groups are your factors. They represent underlying dimensions that can explain the observed correlations. The factors might represent concepts or constructs that are not directly measured by any single variable in your data. You would use an EFA when you want to identify latent constructs that might explain observed patterns and when you're not sure what these constructs are before starting the analysis.

EFA is generally driven by a theory that you want to test - familiarity with the literature and an understanding of what your data is measuring are, therefore, key. 

Key Differences:

A PCA focuses on preserving total dataset variance and simplifying data structure. EFA aims to identify and interpret underlying factors or constructs within the data. PCA produces principal components that are linear combinations aiming to capture maximum variance. EFA identifies factors that represent underlying themes or constructs causing the observed correlations.

Note: PCA components are not necessarily as interpretable as they are mathematical constructs. EFA factors are intended to be interpretable and have theoretical meaning.

In summary, PCA is like summarizing a book to capture the essence without going into details, while EFA is like analyzing the book to uncover underlying themes that explain the storyline.

For a more advanced introduction to PCA either this video (5 min) or this one (20 min).

Note on PCA v EFA

Running a PCA and EFA can feel very similar. Many of the steps are the same. Below we will go over each of the steps you need to take before saving your new factors/Components into your dataset. 

Steps to consider before running a PCA or EFA

Conduct a Reliability Analysis to evaluate the internal consistency of your scales. Tools like Cronbach's alpha are commonly used, with a value of 0.7 or above generally considered acceptable for early-stage research, though this threshold can vary depending on the context. If you do not get the values you need to review and revise your scale and/or remove variables from your analysis. This step might involve further analysis, such as item-total correlations, to decide which items to keep or discard (something not covered in these tutorials.

Once you are ready to proceed, decide whether you want to run a PCA or EFA. 

7.3.2 Assumptions Checks

Note on PCA v EFA

Running a PCA and EFA can feel very similar. Many of the steps are the same. Below we will go over each of the steps you need to take before saving your new factors/Components into your dataset. 

Steps to consider before running a PCA or EFA

Conduct a Reliability Analysis to evaluate the internal consistency of your scales. Tools like Cronbach's alpha are commonly used, with a value of 0.7 or above generally considered acceptable for early-stage research, though this threshold can vary depending on the context. If you do not get the values you need to review and revise your scale and/or remove variables from your analysis. This step might involve further analysis, such as item-total correlations, to decide which items to keep or discard (something not covered in these tutorials.

Once you are ready to proceed, decide whether you want to run a PCA or EFA. 

Assumption Checks: 

Surprise surprise, you will need to check that certain assumptions are met. To demonstrate how a PCA and EFA work we will use the extremism dataset and create a new variable that extremism. In the end, we opted for an EFA because extremism is a latent variable measured through a series of other measures. Extremism. However, a PCA also adequately describes the structures within the variables. 

Barlett's test of Sphericity

First, we run  Bartlett's test of Sphericity. This tests the hypothesis that your correlation matrix is an identity matrix (don't worry too much about this term) which would indicate that your variables are unrelated and therefore unsuitable for combining your variables. 

Check your p-value to ensure a PCA or EFA is an appropriate test. If the assumptions are met you can move on. If they are not met then you need to consider other options. 

KMO Measure of Sampling Adequacy Test

Also, run a Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy Test (MSA).  The MSA score shows the proportion of variance in the variables that is caused by underlying factors. It runs from 0 to 1 with 1 being highly correlated and 0 indicating no correlation.  

The outcome of the assumption checks will be the same for both tests. 

Example:

As mentioned we are using the Extremism dataset and are creating a new variable that measures an individual's Extremism Score. The tables below show us that both assumptions are met. All variables have a KMO above 0.6 and the Bartlett's test returns a p-value below 0.05.

7.3.3 Extraction & Rotation.

ExtractionComponents/Factors

The extraction process in both PCA and EFA aims to reduce the dimensionality of data, but they approach it with different objectives and theoretical underpinnings.

PCA Extraction Process:

The extraction method is inherently defined by the technique itself. PCA identifies the principal components that maximize the variance in the dataset. It automatically computes eigenvalues and eigenvectors from the covariance or correlation matrix of the data. The principal components are extracted based on these eigenvalues and eigenvectors (more on these further down), without the need for the user to select an extraction method. The goal is to transform the original variables into a new set of orthogonal (uncorrelated) variables (principal components) that summarize most of the information (variance) in the original dataset.

EFA Extraction Process:

You must select an extraction method because the goal is to uncover the underlying structure or factors that explain the correlations between variables. EFA is more exploratory in nature and aimed at identifying latent constructs. Common extraction methods in EFA include Maximum Likelihood, Principal Axis Factoring, and Minimum Residuals. The choice of method can affect the factors extracted and their interpretation.

To summarise, the extraction method depends on your dataset's characteristics, the specific goals of your analysis, and the assumptions you are prepared to make. Consider the type of variance you are interested in (common vs. total), the size and distribution of your dataset, and the level of statistical rigor required for your analysis.

Extraction PCA v EFA

PCA does not require the user to select an extraction method, as it is a fundamental part of the technique. EFA requires the researcher to choose an extraction method based on the nature of the data and the research questions. While both PCA and EFA reduce data dimensionality, PCA does this by identifying principal components based on variance, with a predefined mathematical procedure. In contrast, EFA seeks to uncover latent factors that explain correlations among variables, requiring the researcher to select an appropriate extraction method to best reveal these underlying structures.

Rotation:

Both the PCA and EFA rotate factors to help interpret the output. There are two types of rotations: orthogonal, in which factors are not permitted to be correlated with each other and oblique, in which factors are free to take any position including being correlated with each other. After rotation, the factors are re-arranged to optimally go through clusters of shared variance. This makes the factors more interpretable. 

The choice between orthogonal and oblique rotations should be guided by theoretical considerations and the nature of the data:

For instance, in our extremism dataset, we believe that the answers to each of the different variables on violence are likely to be correlated - this is often the case in the social sciences. 

Most statistical programs allow you to easily select the extraction and rotation methods. 

7.3.4 Select and interpret your Components/Factors

Finally, you have to select how many factors/components you want. You have three options:

Use Parallel Analysis when:

Parallel analysis is generally more accurate and less arbitrary than relying on eigenvalues alone. You can also use this method if your data is complex and you want to avoid overfactoring (identifying too many factors) or underfactoring (missing important factors).

Parallel analysis helps you decide how many factors are actually contributing meaningful information above what could be expected by chance.

When to Base on Eigenvalues:

What are Eigenvalues and scree plots? 

In factor analysis, an eigenvalue represents how much of the variance (the amount of information or data spread) in all the variables a particular factor captures. A higher eigenvalue means that the factor is more significant or important in explaining the variation in your data. By convention, a factor with an eigenvalue of 1 or more is considered significant because it explains more variance than a single variable by itself.

We can visually the distribution of how many factors have Eigenvalues of above 1 by using a Scree Plot. t's a simple line plot that shows the eigenvalues of all the factors from your analysis. On the x-axis, you have the factors, and on the y-axis, the eigenvalues. The plot starts with the factor with the highest eigenvalue and goes down to the smallest. You look for the point where the steep decline in eigenvalues levels off — this is often referred to as the 'elbow'. The factors before this point are the 'mountains' (important factors), and after it are the 'hills' (less important factors). This 'elbow' helps you decide how many factors to keep for further analysis.

Using Eigenvalues

In essence, eigenvalues help identify the value of each factor, and scree plots visually aid in determining the number of meaningful factors in your factor analysis.

Eigenvalues measure the variance in the total variables that a factor explains. The traditional "Kaiser criterion" suggests keeping factors with eigenvalues greater than 1. You would use this method when you need a quick and straightforward method and your dataset is relatively simple and you are confident that factors with eigenvalues over 1 will capture the underlying constructs. Relying solely on eigenvalues can sometimes lead to retaining too many factors (since any factor with an eigenvalue over 1 explains more variance than a single variable).

The below plot shows you the eigenvalues for our EFA to create an Extremism score. The dotted line represents the eigenvalues before rotation. The scree plot indicates that only one factor has an eigenvalue above 1 - Factor 1 (with an eigenvalue of 6.4).

Using Fixed Numbers: 

Use fixed numbers if you have a theoretical justification. If you have a solid theoretical basis to expect a specific structure, you might choose to extract a fixed number of factors that align with your theoretical expectations. Sometimes this may also depend on your research goals. If you are exploring a model that theorizes a certain number of constructs, you might fix the number of factors to match this model. In some cases, practical considerations or the need for a simpler model might lead you to choose a fixed number of factors, especially if you need to communicate your findings to a non-technical audience or apply the model in practice where simplicity is valued.

If you decide to use fixed numbers, you can always check, how your outcome differs when you use Eigenvalues or parallel analysis. 

Interpreting the EFA/PCA tables:

Each row represents one of the variables you have included in the test. Each column represents a new factor. The final column provided gives you a uniqueness score. 

Factor Loading:

The numbers in the table are called Factor Loadings. A factor loading is a coefficient that indicates the degree of correlation between an observed variable (e.g., a survey item) and a latent factor, representing an underlying dimension or construct within the data. These loadings, which can range from -1 to 1, signify how strongly each variable is associated with each factor, with values close to 1 or -1 indicating a strong association, and values near 0 suggesting little or no relationship. Essentially, factor loadings help to identify which variables are good indicators of the underlying factors by showing the extent to which variables share a common underlying pattern or theme. This understanding allows researchers to interpret the factors meaningfully, grouping variables based on their loadings to define the latent constructs within their dataset.

Uniqueness Score

In Exploratory Factor Analysis (EFA), the uniqueness (sometimes referred to as "uniquenesses" or "unique variances") of an item is a measure of how much of the variance in that item is not explained by the factors extracted in the analysis. It tells us about the portion of an item’s total variance that is unique to the item itself, not shared with other items through common factors. Here's how to interpret the uniqueness score:

Practical Consideration:

While uniqueness is a valuable piece of information, its interpretation should also consider the theoretical framework of your study and the content of the items. Sometimes, an item with relatively high uniqueness might still be crucial for theoretical reasons or to cover the full breadth of a construct. Therefore, decisions about item inclusion or exclusion should not be based solely on statistical criteria like uniqueness but also on theoretical justification and empirical evidence of item relevance and necessity.

7.3.5 Model Fit

Checking the model fit for an EFA and PCA is an essential step in the analysis process. EFA aims to uncover the underlying structure of a set of variables by identifying latent factors that can explain the observed correlations among these variables. Assessing the fit of the EFA model ensures that the factors derived from the analysis accurately represent the data structure. Here are key reasons why checking model fit is important in EFA:

Checking model fit can help identify problems with the current model, such as too few or too many factors, which can lead to revisions and improvements in the model. Demonstrating a good fit lends credibility to the research by showing that the analysis adheres to statistical best practices and that the findings are robust.

Checking your Model Fit: 

When you are done you should also check the model fit in this case the RMSEA (Root Mean Square Error Approximations), TLI, Chi-Square, and CFI).  It evaluates how well a model approximates the real data. Unlike other fit indices that might merely compare the predicted model to a null model (a model of complete independence), RMSEA takes into account the complexity of the model—the number of parameters being estimated—to avoid rewarding overly complex models.

Interpreting the RMSEA score:

RMSEA values range from 0 to 1, where a value of 0 indicates a perfect fit, and higher values indicate a worse fit. However, values are often interpreted in a more nuanced way.

RMSEA is often reported with a 90% confidence interval, providing a range of plausible values for the true RMSEA in the population. Narrow confidence intervals around low RMSEA values strengthen confidence in the model's fit.

In summary, the RMSEA is a robust measure for evaluating the fit of a statistical model, balancing between the goodness of fit and the model complexity. It helps researchers to assess whether the model adequately represents the data without overfitting by including too many parameters.

The TLI (Tucker-Lewis Index) score

The TLI compares the fit of the specific model to a null model, with values closer to 1 indicating a better fit. The commonly accepted threshold of 0.95 and above suggests that a good fit.

The Chi-square test:

This test assesses the discrepancy between the observed and expected covariance matrices. A significant p-value indicates that the model does not fit the data well. The large χ² value further supports the conclusion of a poor fit, as it suggests a substantial difference between the model and the observed data.

Factor Correlation and Summary: 

Checking the model fit in EFA is not just a technical step; it's a fundamental part of the analysis that impacts the validity and reliability of the research findings. A well-fitting model provides a solid foundation for interpreting the factors, applying the results in further research, and making substantive conclusions based on the data. Therefore, investing time in assessing and reporting the model fit in EFA is essential for conducting rigorous and credible research.

While the assumptions are met and all factors fall within the acceptable range, the model fit suggests that this combination is not necessarily the best fit for a new variable, given an RMSEA of 0.13


7.4 Additional Learning Materials