Interpreting Correlations¶

This is a learning exercise that teaches you how to analyze correlations within mixed data sets. In this exercise we will study the following:

Types of variables as they relate to statistics.
What correlation metrics are common.
Interpretation of correlation metrics.

Variable Types¶

1. Numerical¶

Numerical values fall within two categories:

Discrete - the values are confined to a certain limit and are integers. An example could be a binary value consisting of 0/1.
Continuous - these values can be divided in a way to construct different scales to look at subpopulations. For example, a measurement of age within a population provides continuous values. When you are looking into the subpopulations within that data set, you can simply create custom ranges (ages 1 through 10) to look at distributions.

Here is some code to look at how these values look given their distribution:

%matplotlib inline

import numpy as np
import seaborn as sns
import pandas as pd

N = 1000

# Generate N random binary values
random_binary = np.random.choice([0, 1], size=(N,))

# Generate N random ages from 0 to 100
random_ages = np.random.randint(0, 100, N)

# Plot binary distribution
ax = sns.distplot(random_binary)
ax.set_title('Binary Distribution')
print('mean', np.mean(random_binary))

mean 0.514

The binary distribution has a mean of around 50%. As you provide more samples, you will see it converge closer to the mean (see Central Limit Thereom). The samples consist of an almost equal distribution of 0's and 1's for values.

# Plot age distribution
ax = sns.distplot(random_ages)
ax.set_title('Continuous Distribution\nAges 0 to 100')
print('mean', np.mean(random_ages))

mean 50.326

Due to the range of values that we chose, the mean will converge at around 50. However, you can see that the distribution looks a much different than the binary distribution. This is due to the variability in the values.

2. Categorical¶

These types of numbers are simply used to group specific samples. They do not provide any mathematical meaning. For example, a 1 for male and 2 for female does not provide any importance.

3. Ordinal¶

Categorical values that provide weight with their meanings. It is essentially a mix of numerical and categorical values. For example, a ranking system.

For a race you can think of the following as individuals are given prizes:

1st place
2nd place
3rd place

or

A pain chart provided at a hospital in which 1 through 10 expresses the amount of pain you are experiencing.

Common Correlation Metrics¶

1. Pearson¶

The Pearson Correlation Coeffecient measures the relationship of variables in a linear fashion on an X and Y coordinate system. The values range from -1 to +1. When the value is closest to -1, the relationship is negatively correlated. When the value is closest to +1, the relationship is positively correlated. Having a value close to 0 means that there is no significant correlation between X and Y.

2. Spearman¶

The Spearman Rank Order Correlation measures the relationship of ordinal, interval or ratio variables. These values must be monotonic in nature. Essentially the variables both decrease or increase together.

3. Biserial¶

A Point-biserial correlation consists of a binary variable and a non-binary variable. It is the same computation as the Pearson correlation with the exception of having a binary variable.

	ages	wages
ages	1.00000	0.97941
wages	0.97941	1.00000

	x	y
x	1.0	1.0
y	1.0	1.0

	x	y
x	1.0	-1.0
y	-1.0	1.0

	x	y
x	1.000000	-0.172841
y	-0.172841	1.000000

	age	obese
age	1.00000	0.80615
obese	0.80615	1.00000