Correlation: Pearson vs. Spearman Rank Correlation

5 min readAug 11, 2024

Introduction

Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It is an essential concept in data analysis, helping us understand the relationship between variables and making informed decisions based on that understanding. However, correlation alone doesn’t imply causality, and it’s crucial to choose the correct type of correlation method depending on the nature of the data.

What is Correlation?

Correlation quantifies the degree to which two variables move in relation to each other. A correlation coefficient ranges from -1 to 1, where:

1 indicates a perfect positive linear relationship.
-1 indicates a perfect negative linear relationship.
0 indicates no linear relationship.

Pearson Correlation Coefficient

The Pearson Correlation Coefficient (also known as Pearson’s r) measures the strength and direction of the linear relationship between two continuous variables. It assumes that the data are normally distributed and measures linear relationships only.

Formula:

where

Xi and Yi are the individual data points.
X‾ and Y‾ are the means of the data sets.

Example of Pearson Correlation

Imagine we want to examine the relationship between students’ hours of study and their exam scores. We believe there is a linear relationship between the two variables, so Pearson Correlation is suitable.

Data:

Python Implementation:

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# Data
data_pearson = {
    'Hours_of_Study': [2, 3, 4, 5, 6],
    'Exam_Score':     [65, 70, 75, 80, 85]
}

# Convert to DataFrame
df_pearson = pd.DataFrame(data_pearson)

# Calculate Pearson Correlation
pearson_corr, _ = pearsonr(df_pearson['Hours_of_Study'], df_pearson['Exam_Score'])
print(f"Pearson Correlation: {pearson_corr}")

Output:

Interpretation:The Pearson Correlation coefficient is approximately 1, indicating a near-perfect positive linear relationship between the hours of study and exam scores. As students spend more time studying, their exam scores increase linearly.

Spearman Rank Correlation Coefficient

The Spearman Rank Correlation Coefficient measures the strength and direction of the monotonic relationship between two ranked variables. Unlike Pearson, Spearman does not assume that the data are normally distributed and can capture both linear and non-linear relationships.

Formula:

Where:

di is the difference between the ranks of corresponding values in the two data sets.
n is the number of observations.

When to Use Spearman Rank Correlation:

When the data are not normally distributed or contain outliers.
When the relationship between variables is non-linear but monotonic.
Examples: Ranking students by exam scores or correlating the order of products sold with customer satisfaction.

Example of Spearman Rank Correlation

Suppose we want to analyze the relationship between the ranking of students based on their hours of study and their performance in a competitive game where scores do not increase linearly with study time. In this case, Spearman Rank Correlation is more appropriate.

Data:

import numpy as np
import pandas as pd
from scipy.stats import spearmanr

# Data
data_spearman = {
    'Hours_of_Study': [2,  3,  4,  5,  6],
    'Game_Score':     [20, 45, 40, 70, 65]
}

# Convert to DataFrame
df_spearman = pd.DataFrame(data_spearman)

# Calculate Spearman Rank Correlation
spearman_corr, _ = spearmanr(df_spearman['Hours_of_Study'], df_spearman['Game_Score'])
print(f"Spearman Rank Correlation: {spearman_corr}")

Output:

Interpretation: The Spearman Rank Correlation coefficient is 0.9, indicating a strong positive monotonic relationship between hours of study and game scores. Although the relationship is not perfectly linear (as shown by the variation in game scores), the overall trend is that students who study more tend to score higher in the game.

Difference Between Pearson and Spearman Rank Correlation

Assumptions:

Pearson: Assumes a linear relationship and normal distribution of variables.
Spearman: Does not assume any specific distribution and can handle monotonic relationships.

Sensitivity to Outliers:

Pearson: Sensitive to outliers, as they can significantly affect the correlation coefficient.
Spearman: Less sensitive to outliers, as it ranks the data before calculating the correlation.

Type of Data:

Pearson: Suitable for continuous data.
Spearman: Suitable for ordinal data or when the data are ranked.

Correlation vs. Covariance

While both correlation and covariance measure the relationship between two variables, they are distinct concepts:

Covariance: Indicates the direction of the linear relationship between variables but not the strength. It is not standardized, meaning the magnitude of covariance depends on the scale of the variables.

Correlation: A standardized measure that not only indicates the direction but also the strength of the linear relationship between variables. The correlation coefficient is dimensionless, allowing for easy comparison between different data sets.

Where:

σX and σY are the standard deviations of X and Y.

Correlation is Not Causation

A critical concept in data analysis is that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. There may be other factors involved, such as:

Confounding Variables: A third variable may be influencing both correlated variables.
Coincidence: The correlation might be due to chance, especially with large datasets.
Reverse Causality: The direction of causality might be opposite to what is assumed.

Example:

Consider a scenario where there is a high correlation between ice cream sales and drowning incidents. While these two variables may be correlated, it does not mean that buying ice cream causes drowning. The actual cause might be a third variable — such as warmer weather, which leads to both higher ice cream sales and more people swimming.

Conclusion

Understanding correlation and its different types — Pearson and Spearman Rank — allows for better analysis of the relationships between variables in data. It’s essential to choose the right type of correlation based on the data’s nature and distribution. Additionally, distinguishing correlation from covariance helps clarify the relationship’s direction and strength. Finally, always remember that correlation does not imply causation; careful analysis is required to avoid misleading conclusions.

Karthikeyan Dhanakotti | LinkedIn

Correlation: Pearson vs. Spearman Rank Correlation

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Karthikeyan Dhanakotti

No responses yet

More from Karthikeyan Dhanakotti

RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics.

Introduction

Exploring quantization in Large Language Models (LLMs): Concepts and techniques

Large Language Models (LLMs) such as GPT have transformed natural language processing (NLP), with GPT-3 featuring an impressive 175 billion…

LangChain , LlamaIndex, or Haystack: Which Framework Suits Your LLM Needs?

As large language models (LLMs) continue to advance, choosing the right framework for developing and deploying these models is crucial…

Evaluating Distributional Differences: One-Sample vs Two-Sample Kolmogorov-Smirnov Tests

Kolmogorov-Smirnov Test (KS Test)

Recommended from Medium

Understanding the Phi Coefficient: A Guide to Measuring Correlation Between Categorical Variables

Introduction

SHAP Values for Logistic Regression

I understand that learning data science can be really challenging…

Lists

Staff picks

Stories to Help You Level-Up at Work

Self-Improvement 101

Productivity 101

🚅 Information Theory for People in a Hurry

A quick guide to Entropy, Cross-Entropy and KL Divergence. Python code provided. 🐍

Surrogate Modeling: The Secret to Faster, Smarter Engineering

Its fundamentals, capabilities, and engineering applications

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

A Deeper Dive into Odds Ratios Using Logistic Regression

A comprehensive guide on how to extract and explore odds ratios from a Logistic Regression model using Python and Statsmodels with examples