This is a learning exercise that teaches you how to analyze correlations within mixed data sets. In this exercise we will study the following:
Numerical values fall within two categories:
Here is some code to look at how these values look given their distribution:
%matplotlib inline
import numpy as np
import seaborn as sns
import pandas as pd
N = 1000
# Generate N random binary values
random_binary = np.random.choice([0, 1], size=(N,))
# Generate N random ages from 0 to 100
random_ages = np.random.randint(0, 100, N)
# Plot binary distribution
ax = sns.distplot(random_binary)
ax.set_title('Binary Distribution')
print('mean', np.mean(random_binary))
The binary distribution has a mean of around 50%. As you provide more samples, you will see it converge closer to the mean (see Central Limit Thereom). The samples consist of an almost equal distribution of 0's and 1's for values.
# Plot age distribution
ax = sns.distplot(random_ages)
ax.set_title('Continuous Distribution\nAges 0 to 100')
print('mean', np.mean(random_ages))
Due to the range of values that we chose, the mean will converge at around 50. However, you can see that the distribution looks a much different than the binary distribution. This is due to the variability in the values.
These types of numbers are simply used to group specific samples. They do not provide any mathematical meaning. For example, a 1 for male and 2 for female does not provide any importance.
Categorical values that provide weight with their meanings. It is essentially a mix of numerical and categorical values. For example, a ranking system.
For a race you can think of the following as individuals are given prizes:
or
A pain chart provided at a hospital in which 1 through 10 expresses the amount of pain you are experiencing.
The Pearson Correlation Coeffecient measures the relationship of variables in a linear fashion on an X and Y coordinate system. The values range from -1 to +1. When the value is closest to -1, the relationship is negatively correlated. When the value is closest to +1, the relationship is positively correlated. Having a value close to 0 means that there is no significant correlation between X and Y.
Below we look at comparing age to salary with values that create a strong positive correlation.
ages = [18, 18, 25, 27, 30, 38]
wages = [18000, 24000, 55000, 61000, 65000, 88000]
df = pd.DataFrame({'ages': ages, 'wages': wages})
sns.lmplot('ages', 'wages', data=df)
df.corr()
You can see that these variables have a strong positive correlation. They are very close to 1.
The Spearman Rank Order Correlation measures the relationship of ordinal, interval or ratio variables. These values must be monotonic in nature. Essentially the variables both decrease or increase together.
x = np.arange(0, 100)
x2 = np.arange(1, 5)
y = (1/4) * (x**2)
y2 = np.exp(-x2)
y3 = np.sin(x / 5)
# Plot increasing relation
df = pd.DataFrame({'x': x, 'y': y})
ax = sns.lmplot('x', 'y', data=df)
sns.plt.title('Monotonically Increasing')
df.corr(method='spearman')
# Plot decreasing relation
df = pd.DataFrame({'x': x2, 'y': y2})
ax = sns.lmplot('x', 'y', data=df)
sns.plt.title('Monotonically Decreasing')
df.corr(method='spearman')
# Plot no relation
df = pd.DataFrame({'x': x, 'y': y3})
ax = sns.lmplot('x', 'y', data=df)
sns.plt.title('Not Monotonic')
df.corr(method='spearman')
A Point-biserial correlation consists of a binary variable and a non-binary variable. It is the same computation as the Pearson correlation with the exception of having a binary variable.
Let's create a distribution where you become obese as you get older.
# Create some values
obese = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
ages = [10, 20, 30, 35, 40, 45, 50, 55, 60, 70]
df = pd.DataFrame({'obese': obese, 'age': ages})
df.corr()
df.plot.scatter('age', 'obese')