Correlation: Pearson vs. Spearman Rank Correlation

Karthikeyan Dhanakotti
5 min readAug 11, 2024

Introduction

Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It is an essential concept in data analysis, helping us understand the relationship between variables and making informed decisions based on that understanding. However, correlation alone doesn’t imply causality, and it’s crucial to choose the correct type of correlation method depending on the nature of the data.

What is Correlation?

Correlation quantifies the degree to which two variables move in relation to each other. A correlation coefficient ranges from -1 to 1, where:

  • 1 indicates a perfect positive linear relationship.
  • -1 indicates a perfect negative linear relationship.
  • 0 indicates no linear relationship.

Pearson Correlation Coefficient

The Pearson Correlation Coefficient (also known as Pearson’s r) measures the strength and direction of the linear relationship between two continuous variables. It assumes that the data are normally distributed and measures linear relationships only.

Formula:

where

  • Xi​ and Yi are the individual data points.
  • X‾ and Y‾ are the means of the data sets.

Example of Pearson Correlation

Imagine we want to examine the relationship between students’ hours of study and their exam scores. We believe there is a linear relationship between the two variables, so Pearson Correlation is suitable.

Data:

Python Implementation:

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# Data
data_pearson = {
'Hours_of_Study': [2, 3, 4, 5, 6],
'Exam_Score': [65, 70, 75, 80, 85]
}

# Convert to DataFrame
df_pearson = pd.DataFrame(data_pearson)

# Calculate Pearson Correlation
pearson_corr, _ = pearsonr(df_pearson['Hours_of_Study'], df_pearson['Exam_Score'])
print(f"Pearson Correlation: {pearson_corr}")

Output:

Interpretation:The Pearson Correlation coefficient is approximately 1, indicating a near-perfect positive linear relationship between the hours of study and exam scores. As students spend more time studying, their exam scores increase linearly.

Spearman Rank Correlation Coefficient

The Spearman Rank Correlation Coefficient measures the strength and direction of the monotonic relationship between two ranked variables. Unlike Pearson, Spearman does not assume that the data are normally distributed and can capture both linear and non-linear relationships.

Formula:

Where:

  • di​ is the difference between the ranks of corresponding values in the two data sets.
  • n is the number of observations.

When to Use Spearman Rank Correlation:

  • When the data are not normally distributed or contain outliers.
  • When the relationship between variables is non-linear but monotonic.
  • Examples: Ranking students by exam scores or correlating the order of products sold with customer satisfaction.

Example of Spearman Rank Correlation

Suppose we want to analyze the relationship between the ranking of students based on their hours of study and their performance in a competitive game where scores do not increase linearly with study time. In this case, Spearman Rank Correlation is more appropriate.

Data:

import numpy as np
import pandas as pd
from scipy.stats import spearmanr

# Data
data_spearman = {
'Hours_of_Study': [2, 3, 4, 5, 6],
'Game_Score': [20, 45, 40, 70, 65]
}

# Convert to DataFrame
df_spearman = pd.DataFrame(data_spearman)

# Calculate Spearman Rank Correlation
spearman_corr, _ = spearmanr(df_spearman['Hours_of_Study'], df_spearman['Game_Score'])
print(f"Spearman Rank Correlation: {spearman_corr}")

Output:

Interpretation: The Spearman Rank Correlation coefficient is 0.9, indicating a strong positive monotonic relationship between hours of study and game scores. Although the relationship is not perfectly linear (as shown by the variation in game scores), the overall trend is that students who study more tend to score higher in the game.

Difference Between Pearson and Spearman Rank Correlation

Assumptions:

  • Pearson: Assumes a linear relationship and normal distribution of variables.
  • Spearman: Does not assume any specific distribution and can handle monotonic relationships.

Sensitivity to Outliers:

  • Pearson: Sensitive to outliers, as they can significantly affect the correlation coefficient.
  • Spearman: Less sensitive to outliers, as it ranks the data before calculating the correlation.

Type of Data:

  • Pearson: Suitable for continuous data.
  • Spearman: Suitable for ordinal data or when the data are ranked.

Correlation vs. Covariance

While both correlation and covariance measure the relationship between two variables, they are distinct concepts:

  • Covariance: Indicates the direction of the linear relationship between variables but not the strength. It is not standardized, meaning the magnitude of covariance depends on the scale of the variables.
Covariance Formula
  • Correlation: A standardized measure that not only indicates the direction but also the strength of the linear relationship between variables. The correlation coefficient is dimensionless, allowing for easy comparison between different data sets.
Correlation Formula

Where:

  • σX and σY​ are the standard deviations of X and Y.

Correlation is Not Causation

A critical concept in data analysis is that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. There may be other factors involved, such as:

  • Confounding Variables: A third variable may be influencing both correlated variables.
  • Coincidence: The correlation might be due to chance, especially with large datasets.
  • Reverse Causality: The direction of causality might be opposite to what is assumed.

Example:

Consider a scenario where there is a high correlation between ice cream sales and drowning incidents. While these two variables may be correlated, it does not mean that buying ice cream causes drowning. The actual cause might be a third variable — such as warmer weather, which leads to both higher ice cream sales and more people swimming.

Conclusion

Understanding correlation and its different types — Pearson and Spearman Rank — allows for better analysis of the relationships between variables in data. It’s essential to choose the right type of correlation based on the data’s nature and distribution. Additionally, distinguishing correlation from covariance helps clarify the relationship’s direction and strength. Finally, always remember that correlation does not imply causation; careful analysis is required to avoid misleading conclusions.

Karthikeyan Dhanakotti | LinkedIn

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Karthikeyan Dhanakotti
Karthikeyan Dhanakotti

Written by Karthikeyan Dhanakotti

AI/ML & Data Science Leader @ Microsoft , Mentor/Speaker, AI/ML Enthusiast | Microsoft Certified.

No responses yet

Write a response