Tutorial 1: Understand Data

Tutorial Overview:

One of the most important things in Statistics and Data Science is understanding your data. This is so fundamental, that you shouldn't move on to more advanced tutorials until you have fully understood the ideas and concepts outlined. While you might get results in an analysis, the results won't make much sense unless you have used the correct type of variables.   

In this tutorial, we will cover variables and levels of measurement. 

Throughout this, and all other tutorials, we will be using the raw dataset on Politics, Extremism, Life and the internet. Once you have worked through this material you will be able to apply it to your own datasets. 

1.1 Understanding your data

1.1.1 What are Variables

In the simplest terms, a variable in statistics refers to something that can change or vary. It's akin to a characteristic or feature that may differ across various situations or among different individuals.

A variable is a record of any number, quantity or characteristic that can be measured. Examples are Age, Gender, or Attitudes to name but a few.  So the variable of age, for example, would be a record of the age of every person you spoke to as part of your research. The amount of data a variable can hold is infinite. 

When conducting research, for instance, you might be interested in people's attitudes towards the current government. During data collection, you might gather information about participants' ages, where they live, whom they voted for in the previous election, and their views on the current government. Each one of these pieces of information represents a variable (e.g., Age, Place of Residence, Voting History, etc.). For each participant, the answers to these questions will vary. 

Thus, a variable is essentially a collection of the different outcomes or responses to your question, characteristic, or feature. 

There are different types of variables. We call these levels of measurement. These will be discussed below.

1.1.2 Level of Measurements

Levels of measurement describe the nature of information with the values assigned to a variable. These help us classify variables into types of data. 

These levels are nominal, ordinal, and continuous. These can be split into two main categories: 

Here qualitative does not mean engaging in qualitative research, rather this is qualitative data that is being quantified (e.g. by counting or summing it). In the social sciences, much of the data you will use will be qualitative. 

The importance of understanding and applying the different levels of measurement correctly cannot be understated - your analysis will depend on selecting the correct types of data.

1.1.3 Nominal Data often referred to as categorical data

This data type refers to data that can be divided into separate categories distinguished by names or labels, where the order or ranking of these categories is not meaningful or relevant. Key characteristics of nominal data include:

Understanding the nature of nominal data is crucial, as it dictates the type of statistical tests that can be appropriately applied and influences how conclusions are drawn from the data.

Avoid treating variables where numbers are used to represent categories as continuous variables (e.g., Male = 0, Female = 1, Other = 2, etc). They are not and any results you get will be meaningless. Always check your codebook for what variables represent.

1.1.4 Visualising and Analysing Nominal Data:

This type of data is typically used for grouping or classifying information. Statistical analysis of nominal data includes counting frequencies, calculating modes, or using chi-square tests. Measures of central tendency like the mean or median are not applicable due to the lack of numerical values. 

For visual representation, nominal data is commonly presented through bar charts or pie charts, which show the distribution of frequencies across different categories.

These examples visualise categorical data using bar charts. The examples show you the difference between counts and percentages - more on that in later tutorials. 

1.1.5 Ordinal Data

Ordinal data is a type of categorical data that involves ordering or ranking the values. These are some of the key features: 

Note: Data such as Birth year, dates are ordinal and not continuous variables

Visualising and Analysing Ordinal Data:

In the social sciences, the use of ordinal data is common. Ordinal outcomes are often incorporated into surveys and structured interviews where understanding the order or ranking is important, but the precise differences between these ranks are either unknown or irrelevant.

The analysis of ordinal data is somewhat limited compared to interval or ratio data. You can use measures like the median or mode to describe central tendency, but the mean is not appropriate due to the undefined intervals between ranks. Non-parametric statistical tests are often used for ordinal data.

Ordinal data is generally visualised similarly to nominal bar charts and line graphs, which can display the ordering of the categories effectively. See below, it is also worth pointing out that can choose to represent your data in percentages and counts (normally you should use percentages though)

1.1.6 Continous Data

Continuous data represents measurements and can take any value within a given range (e.g., 0-500). This type of data is often associated with scale measurements. Below are the main features of continuous data.

Continuous Data

Above, you can see a graphical representation between two continuous variables using a scatter plot. Further, we can divide continuous data into interval and ratio data:

Interval Data

Ratio Data

Key Differences

To summarise, ratio data has a true zero point whereas interval data does not. Ratio data can support all arithmetic operations, while only subtraction and addition can be used on interval data. 

Understanding these differences is crucial in statistical analysis, as they determine the types of statistical tests that can be used and the conclusions that can be drawn from the data.

All datasets will have an index column. This column is not a continuous variable - rather it is a unique value that represents each row (generally speaking this variable shouldn't be used in your analysis). 

1.2 Check your knowledge

1.3 Additional Resources & Reading: