# Tutorial 5: Present Data

### Tutorial Overview:

There are many ways to present your data: tables and graphs. In this tutorial, we will guide you through some of the primary methods of presenting your data. These basic lessons will also apply to more advanced tutorials later on.

Simple tables are great tools for presenting some types of data. Visualisations, on the other hand, can make it easier to understand more complex relationships within your data. Entire books have been written about data visualisation – one of the best is listed in the reference list below.

Below, we will guide you through some of the most common methods for presenting and visualising basic descriptive data and basic correlations. Many programs allow you to visualise your data including Excel, Jamovi, SPSSl, etc.

To visualise data, data needs to be reduced to a 2D sometimes 3D graph to help us understand the relationships between the data we are looking at.

Online apps such as Datawrapper and RawData are fantastic free visualisation tools.

## 5.1 Counts, Percentages and Cumulative Percentages

Counts and percentages are frequently used to describe your data, particularly your categorical data. You will use them in many tables and even graphs. Understanding how to interpret them and knowing when to use each one is really important.

Generally, counts and percentages are used for categorical data and these measures are usually not suitable for representing continuous data for the following reasons:

Loss of Information: Using counts for continuous data involves categorizing or discretizing the data, which can lead to a significant loss of information. The nuances and variability inherent in the original data are often lost when it's reduced to counts.

Inappropriate Summary: Counts simply tell you how many times a particular value appears. For continuous data, this is rarely useful because the likelihood of the exact same value repeating is low due to the data's nature (it can take on a wide range of values).

Misleading Interpretations: If continuous data is forced into categories to count them, it may lead to misleading interpretations. The arbitrary nature of how the data is binned or categorized can significantly influence the results and insights.

Alternative Statistical Measures: Continuous data is better represented by measures that acknowledge its nature, such as the mean, median, mode, range, standard deviation, etc. These measures provide a more accurate and insightful understanding of the distribution, central tendency, and spread of the data.

However, there are instances where continuous data might be placed into categories for specific analytical purposes or to make the data more understandable (in the previous session you learned how to transform data).

For example, age, which is technically continuous, might be categorized into groups (e.g., 0-20, 21-40, etc.) for analysis or reporting. But even then, the categorization is done with an understanding that some information is being lost for the sake of simplicity or clarity, and it's not the same as using counts in the way you might for discrete data (like the number of individuals in each age group).

In summary, while there are ways to categorize continuous data, counts are not generally used to represent such data due to the loss of detail and the potential for misleading interpretations. More appropriate statistical methods are recommended to capture the richness and nuances of continuous data.

### 5.1.1 Counts

The most basic way to present your data is using counts. For example, you want to know how many people in your sample have a degree or not. The easiest way is to simply add up all of the instances where someone has a degree and how many do not. You can then present this information in a simple table. That's exactly what counts are – the total number of times something happens or appears.

Let's have a look at the Has Degree Yes/No in the Extremism dataset.

In the context of your example, counting is an effective way to understand the educational composition of your dataset. Let's explore the table below. For now, we are not focusing on the '% of total' and 'Cumulative %' columns yet. However, it's good to know that these columns can provide additional context by showing how each category relates to the whole dataset ('% of total') and how the data accumulates when you add the categories together ('Cumulative %'). More on these two later.

### How to read this table:

Our table focuses on one variable: whether individuals in the Extremism dataset have a degree or not. It is broken down into two categories: 'Yes' for those who have a degree and 'No' for those who do not.

At a glance, this table tells us the distribution of individuals with and without a degree in the dataset. Specifically:

1085 individuals do not have a degree.

1367 individuals have a degree.

Here count is a simple yet powerful way to present this aspect of the data. By just looking at the numbers, we get a sense of the makeup of our dataset regarding educational attainment.

### Interpreting the Counts:

Using counts you can immediately get a sense of which category (having a degree or not) is more common in the dataset. This is the most basic form of data analysis but can be quite insightful, especially when you're starting to understand a new dataset.

Let's have a look at another example from the same dataset but with more levels. Here we are using the variable 'Shares nothing with Society'. The below table shows some of the limitations of counts. As there are more levels it becomes harder to instantly see the proportions of the different levels.

### When to Use Counts:

Use count when you want to know the total number or quantify a variable - like how many people have a degree. When you are exploring your raw data and your primary interest is in the actual numbers rather than proportions or comparisons, counts provide a straightforward representation. When you are exploring your data, counts can help you get a sense of the size and distribution of your data.

### When Counts Might Not Be Ideal:

If you are looking at comparing groups or different sizes, using counts can become misleading. For example, if you are comparing how many people have a degree in two cities or regions, the count may differ due to the differences in population sizes. Counts, however, will not take this into account. In such cases, percentages will provide you with a much better comparison.

Counts tell you about quantities but not about relationships or trends. If you're interested in understanding patterns or changes over time, you might need more sophisticated statistical techniques or visualisations.

Remember, the choice between using counts, percentages, or other statistical measures depends largely on the context of your data and the specific questions you're trying to answer. It's always important to consider the nature of your dataset and the goals of your analysis when deciding how to present your data.

### 5.1.2 Percentages & Cumulative Percentages

### Percentages

Percentages are a way to express a number as a fraction of 100. They help us understand the size of one category in relation to the whole. In the previous example, we used the 'Shares nothing with society' variable (see table below). You might be particularly interested in how many Strongly Agree with this statement which happens to be 136 out of 2,530. It is, however much simpler to say that 5.4% of participants strongly agree with this statement. Percentages provide a clear and concise way to express proportions, making it much easier to understand and compare categories.

Just for context, calculating percentages is easy. In the above case, we divide 136 by 2530. This gives us a decimal of 0.054. All we have to do now is multiply this by 100 to get the percentage. But not to worry statistics programs will calculate the percentages for you. It is worth noting that sometimes stats programs will only give you a decimal for things such as the p-value rather than the final percentage. If you want to express this as a percentage all you have to do is multiply it by 100.

### Cumulative Percentages

Cumulative percentages take the concept of percentages a step further. They don't just tell you about one category; they tell you about the accumulation of multiple categories up to a certain point. Using the same example as above, 5.4% of participants Strongly Agree and 15.9% Somewhat Agree. The cumulative percentage up to Somewhat Agree would be 5.4% + 15.9% = 21.3%. This means that 21.3% of participants agree or strongly agree that they share nothing with society.

In datasets, cumulative percentages help us understand the data's distribution by showing the accumulation of categories. For instance, if you're looking at income levels categorized into groups, the cumulative percentage can tell you what portion of your sample falls below certain income thresholds. This is especially useful for understanding the spread and distribution of your data across ordered categories.

In summary, percentages offer a straightforward way to comprehend and compare the significance of each category in your dataset. Cumulative percentages build on this by showing how categories accumulate, providing deeper insights into the data's structure and distribution. Both are essential tools for effectively describing and understanding categorical variables in your data.

### 5.1.3 Row and Column Percentages

A contingency table, also known as a cross-tabulation or crosstab, is a type of table in statistics that displays the frequency distribution of muliple variables. It is often used to show the relationship between two categorical variables, with one variable corresponding to the rows and the other to the columns. Each cell in the table shows the count or percentage of observations that correspond to the row and column categories. Contingency tables are useful for examining the association between the variables and are commonly used in chi-square tests to determine statistical significance (more on chi-square in a later tutorial).

The table below shows you a contigency table using counts. You should realise quit quickly that counts make it harder to interpret.

To make the table more understandable we use percentages.

Row and column percentages are ways to understand the distribution of categorical data in a contingency table (a table showing the frequency distribution of multiple variables). They offer insights into the relationship between two categorical variables by expressing the frequencies as percentages.

### Row Percentages

Row percentages show the distribution of frequencies across columns within each row. They answer the question: "Given a specific category of the row variable, how are the cases distributed among the categories of the column variable?" The instances within each row will add up to 100%.

In the example below, we can say that out of all the participants who Strongly agree with the statement that they share nothing with society 67.2% are male.

Note: What you cannot say, however, is that 67.2% of all males Strongly Agree with the statement as you are only looking at the instances in the row Strongly Agree.

### Column Percentages

Column percentages show the distribution of frequencies across rows within each column. They answer the question: "Given a specific category of the column variable, how are the cases distributed among the categories of the row variable?" The instances within the column will add up to 100%

Looking at the below table, we can therefore argue that 5.1% of all females Strongly agree with the statement that they Share nothing with Society.

Note: What you cannot say is that only 5.1% of all the participants who Strongly Agree are female as you are looking at all the instances for Gender.

### Usage in Statistics

Understanding the differences between these row and column percentages is very important. This will help you interpret your data correctly.

Row and column percentages are crucial for understanding the relationship between two categorical variables, especially when the total counts in rows or columns differ significantly. They allow researchers to compare categories fairly by standardizing the counts as percentages, making it easier to see patterns or differences.

In social science, these percentages can reveal preferences, behaviours, or trends among different groups, helping to draw meaningful conclusions from the data. They are essential tools in fields like sociology, psychology, and political science, where understanding the distribution and relationship between various social factors is key.

In the final example below, we have added a layer. All a layer does is break down the data further. In this case, for example, the column percentage is dependent on the layer. Here we can say that of all the males 6% agree with the statement that they have nothing in common with society.

Below are two videos that will help you interept contigency tables. The one on the right also explains how to explain contigency tables with multiple layers.

## 5.2 Continuous Data - Single Variables

### 5.2.1 Histogram and Density Plots

Histograms and Density plots are an easy way to visualise continuous variables. Let's have a look at both:

Histograms: These are perhaps the most popular graphs for continuous data. They resemble bar charts but differ because they represent the frequency of occurrence of data in consecutive, non-overlapping intervals or bins. The area of each bar is proportional to the frequency of the items found within each interval, providing a visual representation of the distribution of the data.

Density Plots: These are smoothed versions of histograms and are used to show the distribution of a continuous variable. They can be particularly helpful when the data contains multiple peaks or modes.

Using the Density and histogram plots, we can visually explore our data. Below are two examples.

### 5.2.2 Descriptive Statistics Table

Displaying descriptive statistics in a table provides a clear and concise summary of the data, allowing for quick reference and comparison. Tables help in organizing key measures such as mean, median, mode, range, variance, and standard deviation, making it easier to understand the distribution and central tendencies of the dataset. They are particularly useful for presenting the results of a study to those who may not have a deep statistical background, helping you communication complex data in an easy to understand format.

The table below provides a quick and easy overview of three variables Self Confidence Scores, Racism Scale and Political Engagement. Any statistical software programme should allow you to easily generate such tables. You should be able to easily include/exclude any statistics you don't need.

In the example below we have combined the power of viusalistion and tables.

### 5.2.1 Violin & Boxplots

Violin and Box Plots are another way of visualising continuous variables. These plots provide more information about the distribution, in particular Quartiles and the Inter Quartile Range. They also highlight outliers in the data.

Box Plots: Often also referred to as Box-and-Whisker Plots are a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. They provide a visual snapshot of the central tendency and dispersion of the data, as well as outliers. You interpret a boxplot as follows:

Central Rectangle (The Box): The central box of the plot shows the middle 50% of the data. The bottom and top edges of the box represent the first quartile (Q1) and third quartile (Q3), respectively. This distance is known as the interquartile range (IQR) and contains the middle 50% of the values.

Median Line: Inside the box, there's usually a line that represents the median (the middle value) of the dataset. If this line is not equidistant from the quartiles, it indicates skewness in the data.

Whiskers: The lines extending from the top and bottom of the box, known as whiskers, indicate the range of the data. Their length represents the spread of the data outside the middle 50%, extending to the minimum and maximum values within a certain distance from the quartiles (commonly 1.5 times the IQR). Points beyond this are often considered outliers.

Outliers: Individual points plotted beyond the whiskers are typically considered outliers. These are data points that differ significantly from the rest of the data.

When interpreting a box plot, look for:

The symmetry of the box and whiskers, which suggests the symmetry of the data distribution.

The length of the whiskers in comparison to the box, which indicates variability.

The position of the median, which shows skewness. If the median is closer to the bottom of the box, the distribution is positively skewed (right-skewed), and if it's closer to the top, it's negatively skewed (left-skewed).

Any outliers that might indicate anomalies in the data or areas for further investigation.

Violin Plots: are a method of plotting continuous data and can be used to compare the distribution of data across different categories. Violin plots combine elements of box plots and density plots. The violin shape shows the distribution of the data (its density) at different values, with thicker sections representing higher density (more data points). They are useful for visualizing the distribution and probability density of the data, as well as for comparing multiple groups. The interpretation involves looking at the width of the 'violins' to gauge the data's density at points along the value axis, with the median and interquartile ranges often marked for additional context.

The example below shows you a boxplot of the Extremism score. Notice that this particular variable is zero-inflated. This means that the majority of all values are 0. So your bottom quartiles are all 0. Placing the box plot alongside a violin plot overlayed with the data points gives you a good overview of how the data is distributed.

The two plots below show you the distribution of two more variables using all the plots comibined.

## 5.3 Categorical Data - Single Variables

### 5.3.1 Bar Plots

Bar plots, or bar charts, are a type of graph that represents categorical data (both nominal and ordinal) with rectangular bars. Each bar's height (or length, in the case of horizontal bars) is proportional to the value it represents, which is typically a count or frequency of occurrences for each category.

Bar plots are widely used for:

Comparing Categories: They provide a visual comparison of different categories or groups. The height of the bars allows for quick assessment of which categories are largest or smallest and by how much.

Showing Frequency or Proportions: Bar plots can display how often certain categories occur, or the proportion they represent of the total data.

Identifying Trends: In some cases, especially with ordered categorical data, bar plots can help identify trends or patterns over categories.

Interpreting a bar plot involves analyzing the height of the bars to understand the data's distribution across different categories. Taller bars indicate higher values. One also looks for patterns, such as increasing or decreasing trends across ordered categories. It's also important to consider the scale of the axes and whether the bars represent raw counts, percentages, or other measures. where possible use percentages rather than the raw counts.

Bar plots can also include error bars that extend from the top of each bar to represent the variability or uncertainty in the data, providing a visual cue about the reliability of the data.

You can display the data either horizontally or vertically. You can also choose to stack your data.

Below is an example of two horizontal bar graphs first using displaying the same data one using percentages and the other raw counts.

The next two horizontal bar graphs represent the same data, again using percentages and raw counts, but stacking the data.

Generating these types of graphs is relatively simple in most statistical software programs.

## 5.4 Multiple Variables

### Scatter Plots

Scatterplots are graphs used to visualize the relationship between two continuous variables. Each point on a scatterplot represents an observation in the dataset, with its position determined by the values of these two variables: one variable is plotted along the x-axis and the other along the y-axis.

The Y axis usually represents your dependent variable

Scatterplots are used to:

Identify Relationships: They can reveal the nature of the relationship between the variables, whether it's linear, nonlinear, or non-existent.

Detect Correlations: They help in assessing whether an increase in one variable correlates with an increase (positive correlation) or decrease (negative correlation) in the other variable.

Spot Outliers: Scatterplots make it easier to identify observations that stand out from the overall pattern, which could be potential outliers.

Visualize Distributions: They display how data points are clustered together or spread out, indicating the distribution and concentration of data.

Interpreting a scatterplot involves looking for patterns formed by the data points:

A pattern that runs from the lower left to the upper right suggests a positive correlation.

A pattern that runs from the upper left to the lower right indicates a negative correlation.

If the points are widely scattered with no apparent pattern, they may indicate no correlation.

The closer the data points come to forming a straight line when plotted, the stronger the correlation between the variables.

Additionally, one can fit a line, often called a line of best fit or regression line, to the data to quantify the relationship. The slope of this line, along with the spread of points around it, can give insights into the strength and nature of the relationship.

The Figure below visualises a simple relationship between Extremism and Age. Notice that there is a negative relationship between the two variables.

We can of course also add grouping variables to the graph. These will give us an indication of how the relationship between the two continuous variables changes in different groups.

In the graph below we have added in Military Service (a circle for no military service and an x for those with military service). The circle also increases in size as the Extremism score increases.

In the final Graph we have replaced Military Service with Has Degree. Again the size of the circle increases as the individuals extremism score increases.

### Contigency Tables & Bar Plots

Contingency tables, also known as cross-tabulation tables, are a type of table in statistics that is used to summarize the relationship between several categorical variables. Typically, they display the frequency distribution of variables in a matrix format, where each cell represents the count or proportion of observations for a specific combination of categories from the variables involved.

### Using Contingency Tables:

We use contingency tables for several reasons:

Understand Relationships: Contingency tables help in identifying and analyzing the relationships between two or more categorical variables. By organizing data into these tables, researchers can observe patterns, trends, and potential associations within the data.

To Summarize Data: They provide a concise way to summarize large datasets, making it easier to visualize and understand the distribution of variables and their interrelations.

To Facilitate Comparisons: By breaking down data into categories, contingency tables allow for straightforward comparisons across different groups, which can be crucial in fields like market research, social sciences, and health studies.

To Prepare for Further Statistical Analysis: While we will not cover Chi Square here, it is worth noting that contingency tables set the stage for various statistical analyses by organizing data into a format that can be easily used for tests of independence or association.

Above we introduced you to how to read row and column percentages. By now you should be pretty comfortable interpreting these types of tables.

These tables allow you to identify patterns. High or low counts in certain cells can indicate possible relationships or lack thereof between the categories of different variables. By comparing the counts or proportions across rows or columns, you can identify which combinations of categories are more or less common.

Contingency tables are a fundamental tool in descriptive statistics and exploratory data analysis, providing a foundation for understanding the structure and relationships within categorical data before applying more complex statistical methods.

In the example below we are exploring how Military service and attitudes towards society might affect whether someone has extreme views or not

We can also visualise the above table using bar plots. The below figures use stack bar charts and percentages.

Instead of using stacked bar charts, we can also use clustered bar charts. Below is an example of using them to visualise the data.

To visualise this type of data, we would generally use percentages rather than counts, as counts can be misleading.

### Venn Diagrams

A Venn diagram is a visual tool used to illustrate the logical relationships and intersections between different sets. It consists of overlapping circles, each representing a set.

We can interpret Venn diagram as follows:

Individual Sets: Each circle (or other shapes in more complex diagrams) represents a group of items (a set) that share a common property.

Common Area (Intersection): The area where the circles overlap represents the intersection of the sets, which contains items that are common to all the overlapping sets. If two sets overlap, the items in the overlapping region are part of both sets.

Non-Overlapping Area: The areas of the circles that do not overlap with others represent items that are unique to those sets and not shared with the sets they are not overlapping.

Outside the Sets (Universal Set): Any area not within a circle represents the universal set, which includes all possible items not contained in any of the sets represented.

By examining which areas overlap and which do not, you can determine the relationships between the sets, such as which items are shared and which are unique to certain sets. Venn diagrams are commonly used in probability, logic, statistics, linguistics, and computer science to visually summarize the relationships between different datasets.

This Venn diagram represents the intersection of three categories: Gender, Importance of Ethnicity, and Extremism_Score_scaled. Here's how to interpret it:

The blue circle represents individuals identified as Male (Gender).

The grey circle represents individuals who consider Ethnicity as Very Important (Importance_Ethnicity).

The brown circle represents the Extremism Score (Extremism_Score_scaled).

The numbers within each section of the circles represent the count of individuals, along with the percentage in parentheses:

734 Males (31.5%) are not categorized as considering ethnicity very important nor are labelled Extreme.

228 individuals (9.8%) consider ethnicity very important but do not fall into the male category or are labelled as extreme.

486 individuals (20.8%) who are not Male, don't consider their ethnicity as very important or are recorded as Extreme.

406 (17.4%) are both male and consider ethnicity very important but are not labelled as Extreme

293 (12.6%) are males labelled as extreme but do not consider ethnicity very important.

58 (2.5%) consider ethnicity very important and are labelled as extreme, but are not male.

103 individuals (4.4%) fall into all three categories: they are male, consider ethnicity very important, and are recorded as Extreme.

Together, the data counts add up to the entirety of the dataset.

Venn diagrams are a great way to visually represent the similarities and differences between groups of data.

### Alluvial Digrams

Alluvial graphs, or alluvial diagrams, are a type of flow diagram used to represent changes, movements, and relationships within and between datasets. They are particularly useful for visualizing categorical data over multiple dimensions and for showing how these categories flow from one state to another over time or through a process.

Using Alluvial diagrams:

Alluvial graphs are ideal for displaying changes in data across different phases or time points, such as changes in job roles within an industry or shifts in population demographics. They can show the relationships between different categories and how individual data points can be grouped or regrouped in different ways. Alluvial graphs are particularly effective in highlighting transitions or flows between states, making it easier to identify trends and patterns.

Interpret Alluvial diagrams:

The width of the streams or flows in an alluvial graph is proportional to the quantity they represent, allowing for a quick visual assessment of their relative sizes.

Nodes: The blocks or nodes represent distinct states or categories, and their positioning along the horizontal axis usually corresponds to different time points or conditions.

Flows: The paths connecting the nodes show the movement or transitions of data points between these nodes. By following the paths, one can track how individual elements move between categories.

Interpreting an alluvial graph involves tracing the flows from their origin to their destination, noting the volume of flow and any significant patterns, such as convergence (different groups coming together), divergence (a single group splitting into multiple paths), or consistency (flows remaining stable across stages). This can provide insights into the underlying structure of the data and reveal complex relationships.

The above graph shows the relationship between the Extremism variable and Share Nothing with Society

## 5.5 Suggested Reading Materials

Tuft, Edward (2001) The Visual Display of Quantitative Information. 2nd Edt. Graphics Press