Evaluating Distributional Differences: One-Sample vs Two-Sample Kolmogorov-Smirnov Tests
Kolmogorov-Smirnov Test (KS Test)
The Kolmogorov-Smirnov (KS) test is a non-parametric test used in statistics to compare two distributions. It measures the distance between the empirical distribution functions of two samples or between a sample distribution and a reference distribution. Here’s a detailed explanation:
Purpose
The KS test is used to determine if:
- A sample comes from a population with a specific distribution (one-sample KS test).
- Two samples come from the same distribution (two-sample KS test).
Key Concept
The test is based on the empirical distribution function (EDF). The EDF Fn(x) of a sample is the proportion of sample points less than or equal to x . For a continuous distribution with cumulative distribution function (CDF) F(x) the KS statistic quantifies the maximum distance between Fn(x) and F(x).
KS Statistic
For the one-sample KS test, the KS statistic is defined as:
where sup denotes the supremum (or maximum) value over all
For the two-sample KS test, the KS statistic is defined as:
where Fn(x) and Gm(x) are the empirical distribution functions of the two samples.
Hypotheses
One-sample KS test:
Null hypothesis (Ho): The sample comes from the reference distribution
Alternative hypothesis (Ha): The sample does not come from the reference distribution
Two-sample KS test:
Null hypothesis (Ho): The two samples come from the same distribution.
Alternative hypothesis (Ha): The two samples come from different distributions.
P-value
The p-value for the KS test can be calculated using the distribution of the KS statistic under the null hypothesis. If the p-value is less than the chosen significance level (e.g., 0.05), the null hypothesis is rejected.
For large samples, the null hypothesis is rejected at level
Where 𝑛 Nand 𝑚 are the sizes of first and second sample respectively. The value of 𝑐(𝛼) is given in the table below for the most common levels of 𝛼
Uses
Goodness-of-Fit Test: To check if a sample follows a specific distribution (e.g., normal, exponential).
Comparing Two Samples: To test if two independent samples come from the same distribution.
Model Validation: To validate assumptions about data distribution in various statistical models.
Hypothesis Testing: To perform non-parametric hypothesis tests without assuming specific distributions.
Advantages
Non-parametric: No assumption about the underlying distribution.
Applicable to continuous and discrete data.
Sensitive to differences in both location and shape of the empirical cumulative distribution functions.
Limitations
Less powerful compared to other tests if specific parametric assumptions can be made.
For large samples, small differences can lead to rejection of the null hypothesis.
Python Code for One-sample KS Test
# One-sample KS Test
# import required libraries
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate a sample data
np.random.seed(0)
sample_data = np.random.normal(loc=0, scale=1, size=1000)
# Perform the one-sample KS test
ks_stats, p_value = stats.kstest(sample_data, 'norm', args=(0, 1))
print(f"One-sample KS Test:")
print(f"KS Statistic: {ks_stats}")
print(f"P-value: {p_value}")
# Plot the empirical CDF and the CDF of the normal distribution for visualization
plt.figure(figsize=(10, 6))
ecdf = np.sort(sample_data)
cdf = np.arange(1, len(ecdf) + 1) / len(ecdf)
plt.plot(ecdf, cdf, marker='.', linestyle='none', label='Empirical CDF')
plt.plot(ecdf, stats.norm.cdf(ecdf, loc=0, scale=1), label='Normal CDF')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.title('One-sample KS Test')
plt.legend()
plt.show()
Explanation:
- Generate a sample of 1000 data points from a normal distribution with mean 0 and standard deviation 1.
- Use stats.kstest to test if the sample comes from a normal distribution.
- Plot the empirical CDF of the sample and the CDF of the normal distribution for visual comparison
Results:
- KS Statistic: 0.03737519429804048
- P-value: 0.11930823166569182
Conclusion:
- Null Hypothesis (Ho): The sample comes from a normal distribution with mean 0 and standard deviation 1.
- Alternative Hypothesis (Ha): The sample does not come from a normal distribution with mean 0 and standard deviation 1.
- P-value: 0.11930823166569182 (greater than 0.05).
Since the p-value (0.11930823166569182) is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means there is not enough evidence to conclude that the sample data does not come from a normal distribution with mean 0 and standard deviation 1. Therefore, the sample can be considered to follow the specified normal distribution.
Python Code for Two-sample KS Test
# Two-sample KS Test
# import required libraries
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two sample datasets
np.random.seed(0)
sample_data1 = np.random.normal(loc=0, scale=1, size=1000)
sample_data2 = np.random.normal(loc=0.5, scale=1, size=1000)
# Perform the two-sample KS test
ks_stats, p_value = stats.ks_2samp(sample_data1, sample_data2)
print(f"Two-sample KS Test:")
print(f"KS Statistic: {ks_stats}")
print(f"P-value: {p_value}")
# Plot the empirical CDFs of the two samples for visualization
plt.figure(figsize=(10, 6))
ecdf1 = np.sort(sample_data1)
cdf1 = np.arange(1, len(ecdf1) + 1) / len(ecdf1)
ecdf2 = np.sort(sample_data2)
cdf2 = np.arange(1, len(ecdf2) + 1) / len(ecdf2)
plt.plot(ecdf1, cdf1, marker='.', linestyle='none', label='Sample 1 Empirical CDF')
plt.plot(ecdf2, cdf2, marker='.', linestyle='none', label='Sample 2 Empirical CDF')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.title('Two-sample KS Test')
plt.legend()
plt.show()
Explanation:
- Generate two samples of 1000 data points each from normal distributions with different means.
- Use stats.ks_2samp to test if the two samples come from the same distribution.
- Plot the empirical CDFs of both samples for visual comparison.
Output:
- KS Statistic: The maximum distance between the empirical CDFs.
- P-value: The probability of observing such a distance (or larger) under the null hypothesis.
Results:
- KS Statistic: 0.248
- P-value: 2.104700973377179e-27
Conclusion:
- Null Hypothesis (Ho): The two samples come from the same distribution.
- Alternative Hypothesis (Ha): The two samples come from different distributions.
- P-value: 2.104700973377179e-27 (much smaller than 0.05).
Since the p-value (2.104700973377179e-27) is significantly less than the typical significance level of 0.05, we reject the null hypothesis. This indicates that there is strong evidence to conclude that the two samples come from different distributions. The significant difference in the empirical cumulative distribution functions (ECDFs) of the two samples suggests they do not share the same underlying distribution.