# Descriptive Statistics

## Overview & Learning Outcomes

In this session you will learn about:

These are basic methods that will help you describe your data. The sub-pages will take show you how to get descriptive statistics using Jamovi step-by-step.

Once you have worked through the concepts and the step-by-step guides, try to complete the tasks.

## Dataset used for the examples

Skoczylis, Joshua, 2021, "Extremism, Life Experiences and the Internet", https://doi.org/10.7910/DVN/ICTI8T, Harvard Dataverse, Version 3.

## Concept: Measures of Central Tendenacy

A more concise video is available here.

## Measures of Central Tendency

Measures of Central Tendency provide details that will help you describe the centre of your data in a set of single values. You are probably familiar with all of them.

Each of them is calculated differently, but they all try to describe the 'middle'  or 'centre' of your distribution.

As you can see from the image below, if the Mean, Median and Mode are all the same our data is perfectly distributed. This means that 50% of all values are below the measures of central tendency, and 50% above.

Most of the time your distribution will, however, be skewed.

If your data has a positive skew, your median will always shift to the right of the mode, and the mean will always shift to the right of the Median.

If your data has a negative skew, this is reversed with everything shifting towards the left.

Let's explore each of the three Measures of Central Tendency in some detail. As you continue to work through these tutorials you will come across the mean and median frequently.

This is the only time we will show you how they are calculated as these are measures that you will come across again and again. But don't worry, there will be no other maths involved,  we promise!

A quick note on presentation, usually you will either mention the statistic(s) in the text or display a table (which you can generate with ease).

## The Mean

To get the mean you add up the values of each of your data points. Then you divide this by the number of data points. Don't worry the computer does these calculations for you. But it is good to know how it is calculated.

One of the disadvantages of the mean is that outliers can skew the result, shifting the centre of gravity.

Note: Remember datatypes? Well, the mean should only be used for continuous data. Getting a mean for ordinal or nominal data does not make sense. So when you are presenting your findings keep this in mind.

Let's display some simple results using the Age and Extremism Score variables from the Extremism Dataset.

The table on the top left shows you the mean age and extremism score of all the participants. The table on the bottom splits this by gender. Notice how the means change. The mean extremism score for women is significantly lower than the combined mean.

In the table on the right, you see the same data, but this also includes information on how many participants there were (N), and how much data is missing.

Statistical software packages allow you to generate such tables with ease.

## The Median

To get the median, your data is sorted in ascending or descending order. The median is the middle value, it literally divides your data into two equal halves. As noted, the mean can be affected by outliers, so sometimes the median can be a better description of the centre of your data

As you can see in the image below, if your data has an even amount of observations, the mean of the two middle numbers becomes the median.

If your data has an uneven amount of observations, the middle value becomes the median.

Note: Again, you would only really use this statistic with continuous variables.

The median has now been added to the table for Age and Extremism Score.

From the table below we can see that the median age is higher than the mean which means this variable has a negative skew

The opposite is the case for the extremism variable, where the mean is higher than the median which means the data has a positive skew. A histogram of both variables also shows you the distribution of the variables and confirms the skew.

## The Mode

Line up your data in an ascending or descending order. The value that appears most frequently is the Mode. If you have two values that are equally frequent, you have a bimodal distribution.

Note: Unlike the Mean or Median, you can get the mode for ordinal and nominal variables - as it tells you the most frequent value.

In the table below you can now see all three Measures of Central Tendency. The Mode confirms the skew for both variables.

The tables below shows you the mode for the Household Income variable. Household income is an ordinal, so the numbers represent categories as highlighted.

## Concepts: Measures of Dispersion

Despite their importance, Measures of Central Tendency tell us very little about the distribution of our data. This is where Measures of Dispersion come in. The most common measures are:

We will also introduce you to the concepts of Standardisation using z-scores.

Note: Again, you should only apply these statistics to continuous data. Getting them for nominal or ordinal data will give you results, but these are meaningless.

Most statistical tests are based on something called the normal distributions. These tests are called parametric tests. The normal distribution follows a bell-shaped curve that describes the distribution of continuous variables. Most of the data you will collect within the sciences and social sciences will eventually resemble a normal distribution. The mean, median and mode all represent the peak of this distribution.

Below is a short video on z-scores, that provides you with a really good explanation.

A more concise vide is available here.

## Minimum, Maximum & Range

To some degree the Minimum, Maximum and range are self-explanatory.

The Minimum value is the lowest value in your variable. In the example below the lowest value is 23.

The Maximum value is the highest value in your variable. In the example below, the highest value is 66

The Range is the difference between the Maximum and Minimum. In the example below it is 42. The range gives you a good indication of how wider your data is spread.

These values all take on the same measurement as your actual variable, so if you are measuring age, these will be the age of an individual, if it is height, it will the height of an individual.

## Quartiles & Inter Quartile Range

Quartiles split your data into 4 equal parts. Generally, the quartiles are arranged from the lowest to the largest.

The Inter Quartile Range (IQR) represents the middle 50% and is the difference between Q1 and Q3.

One of the many graphs you will come across is the boxplot - this is a representation of the Quartile in IQR.

Note: Each quartile represents 25 per cent of the data. Once these are displayed, the distances for each quartile may vary as they represent data rather than percentages.

You can, of course, split your data into more than just 4 equal parts. Most statistics programmes make this very easy.

## The Standard Deviations

Standard deviations use this distribution to tell us about the average variability in your dataset. They tell you on average how far values are away from the mean. Most values will cluster around the central region with fewer and fewer values at the edge.

Note: The standard deviation takes the same type of data as your variable. So if your variable is age, then the standard deviation is the same (in this case years).

The graph below demonstrates how much variability is covered by standard deviations. For example, -/+ 1 standard deviation away from the mean covers just over 68% of your data. While -/+2 standard deviations away from the mean covers 95% of the data.

It is also worth mentioning Skewness and Kurtosis. These tell you how skewed your distribution is. Most stats programmes provide you with these values, however, you will never really use them. So there are not covered here.

But if you do want to know more, read on here

## Standardising variables (z-scores)

Variables can be standardised by converting the values of your variables into z-scores, where the mean is 0. Negative numbers indicate that the value is below the mean, while positive numbers indicate that the value is above the mean.

When you convert or standardise a variable, the mean becomes 0 and the values below and above the mean become negative/positive standard deviations. Doing so allows you to easily calculate the probability of a certain value occurring in your distribution. It also allows you to compare datasets with different means and standard deviations.

Z-Scores tell you how many standard deviations your values are below/above the mean.

Remember the normal distribution above? We know for example that -/+1 Standard Deviations away from the mean cover 68.2% of all values.

Example: Let's say we want to find out whether an assessment in 2021 was harder than the one in 2020. Below is some descriptive information about both assessments.

We can easily see that the marks are spread much wider in the second assessment, which has a range of 55 and a bigger Standard Deviation (16.4).

Using z-scores we can compare how likely it is to get, let's say, a mark of 45% or lower in each assessment.

Remember, converting the values is done in the Statistics programme of your choice (you can easily transform a variable into z-scores).

Looking at these numbers we can surmise that students found it harder to achieve a higher mark in 2021. Let's display the results below using the normal distribution. The orange area indicates the likelihood of getting a mark of 45% or below.

Graph:  Generated using the distrACTION module in Jamovi.

Note: To determine whether the differences between the two variables is statistically significant is something that will be determined by a test covered in Tutorial 4.

Go to this website for a more detailed description about z-scores.

## Which graphs do I use?

By default, most stats programmes will proivide you with a table that displays the relevant information. Sometimes, however, you may want to display your data visually. Below is a brief overview of the most common plots you can use to display descriptive data.

Note, your data type is really important. Using the wrong type of data will give you a graph, but it won't make much sense.

## Single Variable

Continuous Data Plots:

Histograms:

A histogram is similar in appearance to a bar chart. A histogram condenses your data into a series of logical ranges (bars). Each bar shows you how many data points appear.

Density Plot:

This plot is similar to a histogram, but instead of having separate bars, you get a smooth line to represent the distribution of your data.

Both these plots are great at visually representing the distribution of your data.

An excellent explanation about what histograms are and when to use them can be found here.

Continuous data Plots:

Boxplots

A boxplot is another way of displaying the distribution of your data. This type of graph displays your Minimum, Maximum, the IQR (the box), your Median (line through the box) and the mean (black square).

You can add data scatter, or you can have it without.

Categorical Data Plots:

Bar Charts:

These types of graphs are a great way of displaying categorical data. Each bar represents a category and the proportion to the value it represents. Bars can be plotted vertically or horizontally.

You can plot them as grouped or stacked bar charts.

Choose between count (represents the actual numbers of cases) or percentages. Note, using percentages is preferred.

## Multiple Variables

Continuous Data:

Scatterplot:

These types of plots show you the relationship between two continuous variables. This relationship can be displayed by a line (usually linear - but it can also be non-linear).  A scatterplot uses dots to represent your data. Each dot represents the intersection between the values of the variables you are using.

Note, these are plots, without accompanying data you should not infer that a relationship is statistically significant.

The example below shows you the relationship between age and the extremism score. As age goes up, the extremism score goes down. The scatterplot also displays the density plot for each variable on the side.

Categorical Data:

Stacked/Grouped Bar Charts

These are the same as normal bar charts, except you add splitting the data by another categorical variable.

In the plots below you see a grouped and a stacked barplot. The examples show you how the views on violence are broken up by gender.

Continiuous v Categorical Data:

Here you would just use a continuous plot, but at a grouping/split variable in. You will then get two graphs that you can compare.

In the example below we use a boxplot to compare the racism score between men and women.