Tutorial 3: Descriptive Statistics

Tutorial Overview:

In this tutorial, we will explore basic descriptive statistics. Descriptive statistics provide us with basic information about our variables. This information will inform our data analysis later on. In this tutorial we have separated descriptive statistics into two main areas: 

4.1 Measures of Central Tendency (Averages)

Measures of Central Tendency provide details that will help you describe the centre of your data in a set of single values. You are probably familiar with all of them. 

Each of them is calculated differently, but they all try to describe the 'middle'  or 'centre' of your distribution. As you can see from the image below, if the Mean, Median and Mode are all the same our data is perfectly distributed. This means that 50% of all values are below the measures of central tendency, and 50% above.

Most of the time your distribution will, however, be skewed. If your data has a positive skew, your median will always shift to the right of the mode, and the mean will always shift to the right of the Median. If your data has a negative skew, this is reversed with everything shifting towards the left.

The statistical software program of choice should generate the measures of central tendency with ease.  

Want to learn this in more detail watch this video here.

It is also worth mentioning Skewness and Kurtosis. These tell you how skewed your distribution is. Most stats programmes provide you with these values, however, you will rarely use them. So they are not covered here.

But if you do want to know more, read on here

4.1.1 The Mean

To calculate the mean, sum the values of all your data points and then divide this total by the number of data points. While computer software performs these calculations automatically, understanding the process can help you understand what is going on behind the scenes. 

A notable disadvantage of the mean is its susceptibility to distortion by outliers. Outliers can skew the data shifting the 'centre of gravity' of your data. This is something you should be aware of. 

Note: You should only use the mean on continuous (in rare circumstances ordinal data have decided to treat it as continuous). Computing the mean for ordinal or nominal data does not provide meaningful results, even if it looks useful. When presenting your findings, ensure that the mean is used appropriately in the context of the data type.

Statistical software packages allow you to generate such tables with ease. But you do need to understand the limitations of the mean. 

4.1.2 The Median

To find the median, arrange your data in either ascending or descending order. The median is the middle value, effectively dividing your dataset into two halves. Since the mean can be influenced by outliers, the median is often a more robust measure of the central tendency of your data.

As depicted in the image below, if your dataset contains an even number of observations, the median is calculated as the mean of the two central values.

Conversely, if your dataset comprises an odd number of observations, the median is simply the middle value.

Note: The median is typically utilised with continuous variables.

4.1.3 The Mode

To determine the mode, arrange your data in ascending or descending order. The mode is the value that occurs with the greatest frequency within your dataset. If two values appear with equal frequency, your dataset is considered bimodal, indicating two modes. 

Note: If there are no identical values, the dataset will have no Mode. 

In cases where more than two values share the highest frequency, the distribution is described as multimodal. The mode provides significant insight, especially in understanding the most common occurrence or preference in a given dataset. 

It's important to note that the mode is a versatile measure of central tendency. Unlike the mean or median, which are best suited for continuous data, the mode can be applied to any data type, including ordinal and nominal categories. This is because it identifies the most frequent category or value, which doesn't necessarily require numerical computation. Therefore, when dealing with categorical data, whether it be survey responses, preference rankings, or most common characteristics, the mode is particularly useful as it conveys the most typical case or the highest frequency of occurrence.

4.2 Measures of Dispersion

Despite their importance, Measures of Central Tendency tell us very little about the distribution of our data. This is where Measures of Dispersion come in. The most common measures are:

We will also introduce you to the concepts of Standardisation using z-scores.

Note: Again, you should only apply these statistics to continuous data. Getting them for nominal or ordinal data will give you results, but these are meaningless.

4.2.1 Minimum, Maxium, & Range

To some degree the Minimum, Maximum and range are self-explanatory. 

To work you need to order your data in ascending order. The Minimum value is the lowest value in your data

The Maximum value is the highest value in your variable.

The Range is the difference between the Maximum and Minimum. So we subtract the Minimum from the Maximum. 

These values all take on the same measurement as your actual variable, so if you are measuring age, these will be the age of an individual, if it is height, it will the height of an individual. 

4.2.2 The Normal Distribution

Above we covered some very basic measures of spread. To take you further let's explore the Normal Distribution.

Most statistical tests are based on something called the Normal Distributions. These tests are called parametric tests. The normal distribution follows a bell-shaped curve that describes the distribution of continuous data. The mean, median and mode all represent the peak of this distribution. As you move away from the middle to either side, the curve slopes downwards, showing that values further from the mean are less common.

The normal distribution has the following attributes:  

Symmetry: It's perfectly symmetrical, meaning half of the values are less than the mean and the other half are more.

Mean, Median, Mode: In a perfectly normal distribution, the mean, median, and mode are all the same, sitting at the centre of the distribution.

Predictability: Notice, that 100% of the data points are covered underneath the bell-shaped curve. This allows for predictions. For instance, in a normal distribution, we know that about 68% (or 34% on each side of the mean) of values fall within one standard deviation above and below (a measure of spread) the mean, and about 95% of all values are within two standard deviations above and below the mean. 

Standard Deviations are covered below. 

This pattern is incredibly useful because it can help us make sense of data, predict probabilities, and make decisions based on statistical reasoning. It's like a universal language for statisticians and researchers, helping them understand and communicate data trends and variances.

Below is an image of the normal distribution and how the data is distributed. 

Be aware that distributions can be skewed. The more skewed your data becomes, the less representative it is of the normal distribution. This is something you must be conscious of later in your analysis, as the normal distribution is an assumption that must be met for many different tests. If this assumption is not satisfied, then you must select alternative tests. 

We will cover this in more detail in later tutorials. It is worth noting how the median, mean, and mode shift in skewed distributions. Understanding this will help you select the most appropriate measure of central tendency that best represents your data. 

This video provides you with a simple overview of how normal distributions work. A more detailed and advanced video can be found below.

4.2.3 The Standard Deviation

Standard deviations utilise the normal distribution to inform us about the average variability in your dataset. They indicate, on average, how far away values are from the mean. Most values will cluster around the central region, with progressively fewer values appearing towards the edges.

Note: The standard deviation is expressed in the same unit as your variable. Therefore, if your variable is age, the standard deviation will also be measured in years.

The graph below does provide you with the formula on how to calculate this. This is just for those who are curious. 

Remember you do have to conceptually understand what the Standard Deviation is and how it works 

The graph above illustrates the proportion of variability encompassed by standard deviations. For instance, +/- 1 standard deviation from the mean encompasses just over 68% of your data, while +/- 2 standard deviations cover approximately 95% of the data.

You can observe how the standard deviation is calculated: it is the square root of the variance. While most statistical packages provide the variance, you will rarely use it directly in your analysis.

For a more comprehensive tutorial on standard deviation, consider watching the linked video. 

Using the example of income, we can determine where values fall in terms of standard deviations. In the given example, we observe that 97.5% of individuals in the dataset have a higher income than Person 1. At the other extreme, we note that 94.9% of individuals have a lower income than Person 10. 

Statistical software programs can automatically generate these probabilities. For instance, in Jamovi, one can use the distACTION module. Calculating how many standard deviations a value is from the mean is straightforward and indeed involves basic arithmetic (both subtraction and division). 

To calculate this value, follow these steps:

(Value of Interest - Mean) / Standard Deviation = Value's Number of Standard Deviations

Let's apply this to Person 1 and Person 10. 

Person 1: 

(£1250 - £2054.50) / £432.01

= -£804.5 / £432.01 = -1.86

Person 10: 

(£2760 - £2054.50) / £432.01

= £705.5 / £432.01 = 1.63

This tells us that Person 1's income is 1.86 standard deviations below the mean, placing them in the bottom 2.5% of the income distribution (this is inferred by considering the normal distribution curve mentioned earlier). 

On the contrary, Person 10's income is 1.63 standard deviations above the mean, which places them above the median income level."

A more concise video is available here.

4.2.4 Standarising Variables (z-scores)

Variables can be standardised by converting their values into z-scores, which represent the number of standard deviations a value is away from the mean. Remember, a standard deviation has the same value as the underlying data.  When using z-scores, the mean is always 0. A negative z-score indicates that the value is below the mean, whereas a positive z-score indicates it is above the mean. 

When you standardise a variable, the mean of the transformed data becomes 0, and the values below and above the mean correspond to negative and positive z-scores, respectively. This transformation facilitates the calculation of the probability of a value within your distribution. Furthermore, it enables the comparison between datasets that have different means and standard deviations.

Z-scores indicate how many standard deviations a value lies from the mean. As previously mentioned, one standard deviation above or below the mean, -/+1, contains approximately 68.2% of the data points in a normal distribution.

Referring to the earlier example of people's income, Person 1 had a z-score of -1.86, and Person 10 had a z-score of 1.63. Using a z-score table, we can determine the percentage of the population that falls below (or above, if required) a certain value. Statistical software typically provides this information, but it can also be obtained from such a table.

According to the z-score table, we find that approximately 3.2% of the population has an income less than or equal to that of Person 1. Conversely, about 94.8% of the population has an income less than or equal to that of Person 10.

Example 2: Let's say we want to compare how students did in two assessments on a course. Was one more difficult than the other? Let us look  

We can also visualis this data. Let's have a look.

The above statistics and visualisations suggest there might be a disparity. We can employ z-scores to standardise the test scores and then compare the level of difficulty.

By using z-scores, we are enabled to compare the probability of attaining, for instance, a mark of 50% or less in each assessment.

Assessment 1: 

Individual 1 has attained a score of 35%, which is -1.62 SD (the z-score) beneath the mean. Employing z-scores, we can determine the precise probabilities; in this case, based on the z-score, we would anticipate that 5.2% of students would achieve a mark as low or lower than Individual 1.

Individual 15 has attained a score of 56%, which is 0.06 SD (the z-score) above the mean. Based on the z-score, we would expect 52.6% of students to achieve a mark as low or lower than Individual 15.

Assessment 2: 

Individual 1 has attained a score of 27%, which is -1.18 SD (the z-score) beneath the mean. Based on the z-score, we would anticipate that 11.8% of students would achieve a mark as low or lower than Individual 1.

Individual 15 has attained a score of 45%, which is 0.21 SD (the z-score) above the mean. Based on the z-score, we would expect 58.7% of students to achieve a mark as low or lower than Individual 15.

Employing z-scores, we can contrast the performance of each student between the two assessments and subsequently infer whether one of the exams was more challenging, as we have standardised the test scores. Below, you can view the test and z-scores for all the students.

Note: To determine whether the differences between the two variables are statistically significant is something that will be determined by a test covered in Tutorial 4.

Go to this website for a more detailed description of what z-scores are.

4.2.5 Quartiles & Inter Quartile Range

Quartiles divide your data into four equal parts, typically ordered from the lowest to the highest value. 

Q1 corresponds to the 25th percentile, indicating that 25% of the data falls below this value.

Q2, also known as the median, corresponds to the 50th percentile.

Q3 represents the 75th percentile.

The Interquartile Range (IQR) encompasses the middle 50% of the data and is calculated as the difference between Q3 and Q1.

One of the many types of graphs you might encounter is the box plot; it visually represents the quartiles, IQR, and the minimum and maximum values of the dataset.

Note: Although each quartile covers an equal 25% of the dataset, the actual spread of the data within each quartile can vary since quartiles reflect the distribution of data values rather than fixed percentages.

You can also divide your data into more segments than just four equal parts if required. Most statistical software provides easy-to-use features for this purpose.

For more information on Quartiles follow this link.

You can, of course, split your data into more than just 4 equal parts. Most statistics programmes make this very easy. 

For more information on Quartiles follow this link.

4.5 The Standard Error

In Statistics, we hardly ever have a full dataset and usually work with sample data. To make it simple, the Standard Error give you an idea of how far your sample statistics might be from the true population mean.  

Essentially, the Standard Error is the distribution of sample means. In the image below you see that we have four groups each with a different income mean. The Standard error is the standard deviation of the mean of the means.

4.5.1 Understand and calculate the SE

The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. It measures the precision with which a sample statistic, such as the sample mean or sample proportion, estimates the corresponding population parameter. To put it simply, the standard error gives you an idea of how far the sample statistic might be from the true population statistic if you were to repeat the study multiple times.

When you calculate a statistic (like the mean) from a sample of data, that statistic is an estimate of the true mean of the whole population from which the sample was drawn. Be aware that unless you take a sample from the entire population, there will still be some error and the SE will vary slightly from the real mean of the population. The standard error tells you how much these sample statistics are expected to vary from one sample to another and, by extension, how much the sample statistic might vary from the actual population value.

Here's an example to illustrate:

The formula for standard error depends on the statistic being estimated (mean, median, mode). For the sample mean, the standard error is calculated as:

The smaller the standard error, the more precise the estimate of the population parameter. It's important to note that the standard error decreases as the sample size increases because a larger sample gives more information about the population and reduces the variability between samples.

In summary, the standard error is a fundamental concept in inferential statistics that quantifies the uncertainty or variability of a sample statistic as an estimator of a population parameter. It plays a key role in constructing confidence intervals and conducting hypothesis tests.

The formula for standard error depends on the statistic being estimated (mean, median, mode). For the sample mean, the standard error is calculated as: 

The formula on the side is for those who are curious. But you should conceptually understand what the Standard Error is and what it does.

The smaller the standard error, the more precise the estimate of the population parameter. It's important to note that the standard error decreases as the sample size increases because a larger sample gives more information about the population and reduces the variability between samples.

In summary, the standard error is a fundamental concept in inferential statistics that quantifies the uncertainty or variability of a sample statistic as an estimator of a population parameter. It plays a key role in constructing confidence intervals and conducting hypothesis tests.

For a much more detailed explanation of what the Standard Error is, watch this easy-to-follow video. 

4.6 Linear Transformation

Linear transformations in statistics are operations applied to data that change the original variables into new variables. The purpose of these transformations is to simplify data analysis, help meet the assumptions of statistical tests, improve interpretability, and facilitate better graphical representation. Linear transformations can help in normalizing data, correcting for skewness, improving the linearity of relationships, and making the scales of different variables comparable.

4.6.1  Linear Transformation

Addition/Subtraction

Addition or subtraction of a constant to a dataset is a linear transformation that shifts the location of the data. This operation changes the mean of the dataset but does not affect the shape of the distribution, variance, or the correlation between variables. For example, when converting temperatures from Celsius to Fahrenheit, the addition is part of the transformation. Or you might want to change a variable that has negative data points to one that has all positive data points. 

Subtraction is particularly useful for centring data by subtracting the mean, which is a common step in various data preprocessing techniques and can improve the numerical stability of certain algorithms.

Multiplication

Multiplication of data by a constant is another linear transformation that changes the scale of the data. It affects measures of spread (such as standard deviation and variance) but does not alter the shape of the distribution or the correlation between variables. Multiplication is used in normalizing data, where values are adjusted to a common scale without distorting differences in the ranges of values.

Multiplication does stretch or compress the distribution of a dataset. When you multiply all data points in a dataset by a constant, you change the scale of the distribution. This means that measures of spread such as the range, standard deviation, and variance are all affected:

However, the shape of the distribution, such as the skewness or kurtosis, and the correlation between variables, do not change. So, while the data points move further apart (or closer together), the relative distances between them remain the same, preserving the overall structure of the dataset. This is why multiplication is considered a linear transformation, as it maintains the linearity of relationships between variables.

4.7 Suggested Reading Materials

For those interested in learning the underlying maths read: