Tutorial 2: Preparing your Data

Tutorial Overview:

With your newfound knowledge of variable types and data structures, it's crucial to understand that datasets typically arrange data in rows and columns. The process of preparing your data for analysis can take a significant amount of time but is essential. 

You can of course find data that is ready to be analysed, but usually, there is always something that needs to be done to get the data ready. Understanding how datasets work and how to prepare them is a vital part of data analysis. We'll explore dataset structures, tackle data cleaning (including dealing with missing values), and learn about variable transformation. 

Python and R make cleaning data much easier, but it does require coding skills. Alternatively, you can use Excel, although this will be more time intensive, and to some extent, Jamovi to get your data ready.  This is your basic primer on the subject.

To prepare your data you will usually follow the following steps:

Throughout the process document what you do and why you do it. This allows others to replicate and check your analysis!

2.1 Dataset Structure

2.1.1 Explore Dataset Structures

If you're familiar with Excel, you will recognise the standard structure of datasets: organised into rows and columns. While datasets vary in appearance, this basic format is consistent. 

Columns: Each column represents a variable, which can be any characteristic, number, or quantity that is measurable. Variables include attributes like an individual's age, sex, income, ethnicity, or views on a topic.

Rows: Each row corresponds to a single observation, which may include multiple fields — for instance, a participant's complete survey responses or comprehensive details of an incident. These rows are comprised of numerous variables, each represented by a column.

It is crucial to grasp this row-and-column structure, as it is a constant across various software like Excel, Jamovi, SPSS, R, and Python. Usually, your dataset will come as a .csv file. However, there are other data structures, such as .json files, but these need to be transformed into the row-and-column format before analysing the data.

The above image gives you a good idea of what your data should look, prior to you analysing your data. 

Familarise yourself with this structure - you will not be able to get away from it. 

2.2 Data Preparation

2.2.1 Explore your data

Understanding and exploring your dataset before diving into data preparation is a critical step in any data analysis project. It serves as a foundational stage where you familiarize yourself with the nature, structure, and idiosyncrasies of your data. This preliminary step can inform how you approach cleaning, transforming, and analyzing your data.

Familiarize yourself with the Datas

Review Metadata:

Start by examining any available metadata, which includes information about the data source, collection methods, data dictionary (descriptions of variables), and any previous manipulations the data might have undergone. Often this can be found in a codebook. If you have collected the data yourself, consider creating a codebook to record this data

Initial Observations:

Open your dataset in a data analysis tool and do a cursory review to understand the basic format, the number of variables (columns), and the number of observations (rows). Check for:

Descriptive Statistics

Generate summary statistics for numerical data to get a sense of the central tendency (mean, median), dispersion (range, interquartile range, standard deviation), and shape (skewness, kurtosis).

For categorical variables, look at frequency distributions to understand the proportion of each category and to spot any categories that may have very few observations.

Data Quality Assessment

Identify if the dataset has missing values, where they are, and how they're represented (e.g., as NaN, NULL, blank spaces, or placeholders like -999).

Search for and assess the impact of duplicate rows or entries. Decide on a strategy for handling any duplicates you find.

Look for outliers or anomalies in your data, which could be errors or legitimate extreme values.

At this state you might also consider visualising some of your data, to help you get a better understanding of what the data is hidding. Use visual tools like histograms, box plots, and scatter plots to observe the distribution and relationships between variables.

Initial Findings and Tidy Data:

Note down any interesting patterns, trends, or initial insights that could guide more in-depth analysis later.

Ensure that your data is tidy; each variable should form a column, each observation should form a row, and each type of observational unit should form a table.

Identify areas that will require cleaning or transformation based on your exploration.

By thoroughly understanding and exploring your dataset upfront, you'll set a strong foundation for the subsequent steps in your data analysis process. This phase ensures that the data you work with is well-suited for your analysis objectives and that the insights you derive are reliable and meaningful.

2.2.2  Remove unnecessary rows and columns & Rename Variables

Data preparation is a critical step in the data analysis process, involving cleaning and transforming raw data into a format suitable for analysis. Part of this process includes removing unnecessary columns and rows, as well as renaming variables. Here's an overview of these tasks:

Removing Unnecessary Columns and Rows

Raw datasets will often have variables that are simple of no use.  When you conduct a survey on a platform like Qualtrics or Surveymonkey, the raw data contains additional variables such as progress, the persons IP address etc. There may also be questions about ethics, that will add nothing to your dataset and can be deleted.

Delete incomplete observations:

It is highly likley that you have some incomplete observations. You may want to delete some of these incomplete observations. You will need to excercise some judgment when deciding what incomplete observations to delete. 

In a survey, if a participant has only complete 50-60% maybe even 70% of the survey consider deleting such observations - but again it depends what data is missing. If the questions they have answered are inconsequanicial, the consider keeping the observation. 

If you download your raw data from Qualtrics, JISC survey or Surveymonkey, they will usually have some sort of progress variable that tells you how much of the survey has been completed. You can use the filter function in Excel and filter out all of the observations you want to keep. Then delete those remaining. 

Columns: Removing unnecessary columns simplifies the dataset by eliminating irrelevant, redundant, or unimportant data. This is crucial because it makes the dataset easier to work with and understand.

Rows: Similarly, removing unnecessary rows, or "observations", can be important for several reasons:

Renaming Variables

Renaming variables in a dataset is an essential step that enhances clarity and readability, which in turn facilitates better communication of analysis results. Renaming helps to: 

Often you will need to rename your variables. Often the raw dataset will include the question you asked. This is not a suitable variable name. Variable names should be short (but still make sense). 

Depending how on the size of your dataset, this may take quite some time. You should also add the questions into your codebook, as an easy reference.

Consider the following?

When performing these tasks, consider the following best practices:

Documentation: Always keep a record of all changes you make to the original dataset, including which columns or rows were removed and any variable names that were changed. Your documentation should clearly communicate the rationale behind removing certain data and renaming variables so that everyone who uses the datasets knows what to expect. 

Consistency: Apply a consistent logic when deciding which columns and rows to remove and when renaming variables. This consistency should be aligned with the goals of your analysis.

Validation: After removal and renaming, validate the dataset to ensure that no critical information is lost and that the dataset still accurately represents the underlying phenomena.

Backup: Always keep a copy of the original data before making any changes. This allows you to revert back in case of any errors during the data preparation process.

By carefully considering the need to remove and rename data, you can significantly improve the quality and utility of your data, making subsequent analyses more robust and reliable.

2.2.3 Adjust your Variables and Levels

For Continuous Variables:

Data Entry Errors: Check for errors in our data. Look out for impossible values. Some survey programs for example automatically record missing values as 99 or -99 or you might have a surevey for participants 18+ but have values below that. Such values need to be removed and/or corrected. 

Consistency Checks: Make sure that the data across the related variables makes sense (e.g., age, income, etc) and is conistent. 

Standardise Formats:  Make sure your continuous data follows a standard format (e.g., number of decimal places, etc).

For Ordinal Variables:

Make sure that the values are accurate. Convert the levels of ordinal variables into numerical codes, as this is necessary for certain statistical tests that require numerical input. Some programs (such as Jamovi or R) allow you to attach labels to your numbers. For example: Strongly agree = 0 

If you do not encode your levels, some tests may not function correctly. Even when tests run without errors, you must still inform the statistical program about the order of the levels.

For Nominal Data:

You may also choose to convert nominal data labels into numbers based on personal preference. This step is not mandatory, but doing so can make the data easier to handle. Ensure that the text for your labels is accurate. For example, use 'Other' instead of 'Other:'. To convert the labels into numbers, you could use:

Note:  Encoding does not turn your nominal variable into a continuous or ordinal variable.

After performing the above steps, your dataset should be relatively clean. In Jamovi, also verify that the correct level of measurement is selected for each variable.

2.2.4 Recode a Variable

Recoding involves converting variables from one format or structure into another. It is typically used with categorical data. We recode variables as some statistical tests and programs can only handle numerical data, so categorical data need to be recoded into a numerical format. There are several methods. 

2.2.5 Deal with Missing Values 

When analysing your data, you MUST instruct the statistical programme to recognise missing data. Otherwise, you risk obtaining wrong results. Qualtrics, for instance, designates missing data with a '-99' code. If you neglect to inform your statistical programme that '-99' signifies missing data, the computer will erroneously incorporate these values into its calculations. Other programmes such as Jamovi or Python represent missing data with NaN.

Ensure you adopt a method appropriate and consistent with the statistical programme you're utilising. For example, Jamovi identifies blank cells, 'na', or 'NaN' as missing data, requiring no additional action on your part. However, if you're dataset uses '-99', you will need to manually tell Jamovi to recognise this as the missing value code.

There's no need to eliminate all missing data, as most statistical programmes will exclude it from analysis (this will affect your sample size). Excessive missing data may trigger an error, indicating that analysis is unfeasible. Adding in missing data will increase the available data for your analysis. 

Working with Missing Data:

It's inevitable that you will encounter missing data, even after removing incomplete observations. There are two primary approaches to handle missing data:

After choosing your method, decide which data (rows and columns) to discard, then proceed to impute the missing values using your chosen method.

For a more indepth dicussion on dealing with missing values visit this website.

2.2.6 Transform Variables

Transforming variables is a common operation in data analysis. There are a few common types of transformations:

The specific method to use depends on the nature of your data and the goals of your analysis. We will keep it simple for now and only cover recoding and binning. 

Recoding is covered above. Below we also looked at One-Hot Encoding and Bucketing/Binning in more details.

One-Hot Encoding: 

One-Hot Encoding converts each category level into a new binary variable (0 or 1). Each level is represented by a dummy variable.

How does One-Hot-Encoding work: 

One-hot encoding is a process used to convert categorical data into a numerical format by creating a binary column for each category of the variable. In one-hot encoding, each category becomes a new column, and a binary value of 1 or 0 is assigned to those columns in each row of the dataset.

Below is a step-by-step summary of how one-hot encoding works:

The result is a "one-hot" because only one category column is hot (1) for each row, while all others are cold (0). This technique is widely used to prepare categorical data for many types of machine learning algorithms.


Imagine you have a laundry basket full of socks of different colours. Now, you want to organize these socks so that it's easier for you to find a pair when you need it. So you sort the socks into different bins or buckets based on their colour. All the red socks go into one bin, blue into another, and so on. This is what binning or bucketing is like in data analysis.

Now, let's relate this to data:

Bucketing or binning is like sorting socks. But instead of socks, you have a lot of numbers (data points). These could be anything: ages of people, prices of houses, temperatures, etc. When there are too many different numbers, it's hard to make sense of them. So, you put these numbers into groups (or bins) to organize them better.

Why Use Bucketing/Binning?

Simplicity: It makes your data simpler. Instead of looking at every single age in a survey, you look at age groups like 0-18, 19-35, etc.

Manageability: It makes large data sets more manageable. Imagine trying to understand the individual ages of thousands of people at once!

Analysis: It helps in analysis. It's easier to see patterns and trends when you have groups to compare, like seeing which age group buys more online.

How Does it Work

That's bucketing/binning in a nutshell – organizing your data into groups to make it easier to understand and work with.

Simple transformations can easily be done. However, often you may want to start combining variables to reduce them into one variable.

Of course, this is something that can be done. However, don't just go and add variables together. There are statistical tests, such as a Reliability Analysis, a Principle Component Analysis, or an Exploratory Factor Analysis, covered later that should be used to combine variables. These tests will show you whether the variables you want to combine are statistically correlated or not. This topic will be covered in more depth later.

When transforming a variable, ensure that each new category is exclusive. There should be no overlap between the categories. 

E.g., In the above example the 'under 65' category should only include values up to, but not including 65, so any values from 0-64. The over 65 variable should include all values above 65. If you overlap them, you will be counting some data points twice. 

2.3. Suggested Reading: 

2.4 Additional Reading: Quantitative Research Design

In this tutorial, you will learn about understanding the data that has already been collected. If you conduct your own research, it is really important to understand these concepts and think about them in your research design phase. The readings below provide you with a good overview of Quantitative Research Design.

While research design is extremely important, it is not covered here, as the focus is on learning basic data analysis skills.