Tutorial 1:

Working with Quantitative Data

Overview & Learning Outcomes

In this tutorial you will learn about: 

Getting to grips with these concepts is fundamental.  The above concepts are fundamental. Getting this wrong later can lead to an analysis that is meaningless. This page goes over these fundamental concepts.

In the sub-pages for this tutorial, you will learn how to apply these concepts to your own datasets. 

Suggested Reading: 

Further Reading: Quantitative Research Design

In this tutorial, you will learn about understanding the data that has already been collected. If you conduct your own research, it is really important to understand these concepts and think about them in your research design phase. The readings below provide you with a good overview of Quantitative Research Design.

While research design is extremely important, it is not covered here, as the focus is on learning basic data analysis skills.  

Concept: Level of Measurement & Variables 

All datasets are made up of a series of variables. Each variable holds a specific dataype (continuous/ordinal/nominal). In this section we will introduce you to types of variables and the data that they can hold. 

Level of Measurement: Continuous/Ordinal/Nominal

Each variable contains data. It is this data that can be classified into the following three categories, sometimes also called the level of measurements:

Knowing what type of data your variable holds is extremely important. Getting a grasp of what type of data your variables hold cannot be overemphasized. Statistical tests require very specific data. Although you may still get results, using the wrong data will lead to wrong and meaningless results. 

Understanding the distinction between these types of data is extremely important as it will determine what tests, plots and graphs you can and cannot use in your analysis. 

Continuous Data 

Continuous data is expressed in numbers. Essentially, the numbers measure something on a numerical scale and the distance between each data point never changes. 

This type of data can be further broken down into two categories of data:

For example, when you are talking about people it generally does not make sense to talk of 9.5 people (Discreet data) but it does make sense talking about height as 1.85m (continuous data). 

Discrete data is counted - Continuous data is measured.

The continuous data is further subdivided into interval and ratio data. 

For the purpose of these tutorials, all you need to remember is that continuous data is data that measures something on a continuous scale and that the distance between each datapoint is known.

Ordinal Data

Ordinal Data classifies variables into categories that have a natural order or rank. Unlike nominal data, ordinal data can be placed into a hierarchical order, e.g. from high to low. That said, the distance between each group is not necessarily known. 

Examples of ordinal data include levels of education, socio-economic status, satisfaction rankings, etc. Many Likert Scale type questions fall into the ordinal data category.

Ordinal data can contain numbers, however, these numbers represent a categorical label. Just because you see numbers does not mean it is a continuous variable. Always check your variable closely to see what they actually measure.

As you can see from the image below, there are three categories. As these relate to age, they can be placed in a logical and sequential order, but the difference between is category is unknow and/or different. 

Note

As numbers represent categorical labels, they have no inherent mathematical value. This means you should not calculate measures such as the Mean, Median etc. These measures become meaningless for ordinal data. 

To represent your data use the following instead:

Usually, you would display the data as percentages rather than as counts, but you can consider adding both into your tables/graphs. 

Nominal Data

Nominal data, sometimes also called categorical data, is a variable that contains labelled data that cannot be placed in any logical or sequential order. The data is not measured or evaluated, it is just assigned to multiple groups. 

Examples include ethnicity, gender, birthplace etc. You can see from the image below, that there is no inherent order for the variable Birthplace. 

Note

Often nominal data can be displayed using number, e.g. 0 = 'Male'; 1 = 'Female'. These numbers are placeholders for categorical labels and this does not make them a continuous variable. 

To represent your data use the following instead:

Usually, you would display the data as percentages rather than as counts, but you can consider adding both into your tables/graphs. 

Using any other descriptive statistics (more on that in the next tutorial) will lead to numbers that look correct, but that are actually meaningless.  

Variables: What are they?

A variable is a record of any number, quantity or characteristic that can be measured. Examples are Age, Gender, or Attitudes to name but a few.  So the variable of age, for example, would be a record of the age of every person you spoke to as part of your research. The amount of data a variable can hold is infinite. 

Variables can be manipulated, controlled for or measured in your research. Research experiments will consist of a series of different variables. The following are the most common types of variables: Independent, dependent, control, latent and composite variables. 

Independent & Dependent Variables

Think of Independent and dependent variables in terms of cause and effect. Your independent variable(s) is the variable that you think is causing the change in the dependent variable - the effect.

In the image below, the amount of water/amount of fertiliser is your independent variable while the height of the plant is your dependent variable. 

Your implicit assumption is that the lots of water is increasing the height of the plan. 

In short, A (independent variable) > changes B (dependent variable)

Sometimes Independent variables are also called: 

Dependent Variables are also sometimes called:

For a more detailed discussion, you can read more here

Composite & Latent Variables

Composite Variables

These are variables that are made by combining multiple variables into one. You do this at the analysis stage. How to do this will be covered later.  In the example below, a series of questions about attitudes are combined into a new variable that measures Extremism. 

Latent Variables:

These are variables that serve as a proxy for something that you cannot directly measure. In the example above, not only is this a composite variable,  but it is also a latent variable as you cannot measure extremism directly.

What is the difference between the two?

Latent variables give rise to measurable manifestations of an unobservable concept, while composite variables arise from the total combined influence of measured variables.

Grouping Variables

These variables are variables that are split up into one unique group for each unique value of the grouping variable. In the example below, Gender is the grouping variable, which is split into 'Male' and 'Female'. 

Grouping variables can be used to measure the difference between these groups. E.g., what is the difference between 'Males' and 'Females'.  

Note, grouping variables can often also be your independent variable.

Confounding, Covariate & Control variables

This type of variable is a variable that is associated with both the independent and dependent variables but is not the subject of your study.  Confounding variables may distort or mask the effects of another variable on the outcome. 

Confounding variables affect the variables being studied meaning the results may not reflect the actual relationship between the independent and dependent variables. The best way to control for confounding variables is a good research design and in-depth knowledge and understanding of your research area.  

Covariates: 

A Covariate can be an independent variable, which may be of direct interest to your study, or is an unwanted confounding variable. Adding covariates can increase the accuracy of your models.  In studies on extremism, for example, Household income can be an additional covariate to Age. Household income could also be considered a confounding variable.

Control Variables: 

When analysing your data, you may want to control for or keep a specific variable constant. As shown on the left, your independent variable changes, and so does the outcome, but one or more control variables are held constant (e.g. water levels stay the same despite a change to the independent variable.

Concept: Getting your Data Ready

Now that you have a good understanding of what type of variable and datatypes there are, it is important to explore how most datasets are structured - in Rows and Columns

Data analysts around 80 to 90 per cent of their time cleaning their data. In this course, the data provided will be cleaned. However, if you conduct your own research or download data from data repositories data cleaning is often required. Below, you will be introduced to basic dataset structures and what to do to clean your data. Cleaning data can be done in Excel (very time consuming), Python or R. 

Below we will cover the following:

This should provide you with a basic introduction to this topic. 

Dataset Structure 

If you have used Excel before you should be pretty familiar with basic dataset structures. Although each dataset will look different, the structure remains the same. Have a look at the image below. Most datasets will be organised into Rows and Columns. 

Each row represents one observation. Each observation can, however, have multiple fields. For example, one row might represent a participant's answers to all the survey questions, or it might record all the information about an incident. One row is made up of a number of variables (columns).

Each Colum represents one variable. A variable is any characteristics, number, or quantity that can be counted or measured. A variable will be something like an individuals age, sex, income, ethnicity or view on a topic. 

Understanding the Rows/Columns structure is really important. This structure will remain the same regardless of what software you use, whether it is Excel, Jamovi, SPSS, R, Python, etc. 

It is worth noting that there are other types of dataset structures e.g. .json. However, these are often convert into datasets like the one shown above. 

The row/column structure remains the same, regardless what software platform you use, whether it be Excel, Jamovi, SPSS, R, Python, etc. 

It is worth noting that there are other data structures such as json. You will not be introduce to these here, but even they are often converted to a row/column format to make analysis easier. 

Data Cleaning

As mentioned above cleaning your data can take up most of your time. Below are some of the main things you should consider when cleaning a dataset

Delete unwanted Variables (columns)

Raw datasets will often have variables that are simple of no use.  When you conduct a survey on a platform like Qualtrics or Surveymonkey, the raw data contains additional variables such as progress, the persons IP address etc. There may also be questions about ethics, that will add nothing to your dataset and can be deleted.

Rename your variables (Column Header):

Often you will need to rename your variables. Often the raw dataset will include the question you asked. This is not a suitable variable name. Variable names should be short (but still make sense). 

E.g.: How old are you? > Age

Depending how on the size of your dataset, this may take quite some time. You can add the questions into your codebook

Delete unwanted Headers (Top Rows):

Often your raw dataset will include a number of rows at the top that are superfluous. Delete them out - often this is just useless data. If it is needed add it to the codebook. 

You should now be left with one Header row (this should have your variable names) and lots of rows with raw data in them.

Delete incomplete observations:

It is highly likley that you have some incomplete observations. You may want to delete some of these incomplete observations. You will need to excercise some judgment when deciding what incomplete observations to delete. 

In a survey, if a participant has only complete 50-60% maybe even 70% of the survey consider deleting such observations - but again it depends what data is missing. If the questions they have answered are inconsequanicial, the consider keeping the observation. 

If you download your raw data from Qualtrics, JISC survey or Surveymonkey, they will usually have some sort of progress variable that tells you how much of the survey has been completed. You can use the filter function in Excel and filter out all of the observations you want to keep. Then delete those remaining. 

Working with missing data:

It is inevitable that you have missing data, even after you have deleted incomplete observations. There are two ways of handling missing data:

For these tutorials we are going to use step one - though we actually won't delete all of the observations with missing data. You can find more information about imputation here

When you clean your data you should do the following:

When analysing your data, you MUST tell the stats programme how to identify missing data. Otherwise, you may end up with some bizarre results. Qualtrics, for example, codes missing data as '-99', if you forget to tell your stats programme that all values of '-99' represent missing data, the computer will use them within its calculations. 

Where you have missing data, make sure you use an approach that is appropriate and consitent for the Stats programme you are using. Jamovi, for example, recognises 'blank' cells or 'na' as missing data - you have to take no further action. If you use '-99', on the other hand, you would have to manually tell Jamovi that this value is the missing value.

We don't need to delete all of the missing data as Jamovi removes missing data from our analysis.  If there is to much missing data you will get an error telling you that the analysis is not possible. 

Correct your levels: 

Make sure that all the values are correct. If you have ordinal variables change the levels labels to numbers. You can later reattach the labels in Jamovi. For Example: 

If you do not change your levels, some ordinal tests will not work. Even if they work, you must put them tell the stats programme the order they belong in.

For nominal data you should consider switching the labels to numbers too (personal preference), however, this is not necessary. If you don't change it over, makes your text is correct. E.g. 'Other' rather than 'Other:'. You can also convert the labels into numbers e.g.: 

Remember this does not turn the variable into a continuous variable.  

Once you have gone through all of the above steps your dataset should be pretty clean. In Jamovi, you should also make sure that the correct level of measurement is selected. 

Computers like using numbers. Ideally, your ordinal variables will have numbers with relevant labels attached. Each number represents the order of the group. 

If your variable has text instead of numbers, then the computer will not recognise the variable as an ordinal variable. You will either have to manually order the variables or change them over. More on than in the section on data cleaning. 

Transforming your Data

Often you may need to transform variable(s) into another. Lets say you have collected the age of individuals, but you actually want to know if there is a difference between those those aged 50 and under and those over the age 50. 

Many of the transformation are about simply transforming an existing variable into a new variable with new groups and labels.

Most Statistics programmes make is transforming one variable into the next relativley easy. In the image below you see a simple transformation of a continuous variable into an ordinal variable. 

One of the sub-pages for this tutorial take you through transforming variables in Jamovi. 

As pointed out simple transformations can easily be done. However, often you may want to start combining variables to reduce them into a latent or composite variable. 

Again, this can easily be done, but before we do this we should either run a Reliability Analysis, a Principle Component Analysis, or an Exploratory Factor Analysis. These tests will show you whether the variables you want to combine are actually statistically correlated or not. This topic will be covered in more depth in Tutorial 5.