This is a learning exercise that teaches you how to analyze correlations within mixed data sets. In this exercise we will study the following:

- Types of variables as they relate to statistics.
- What correlation metrics are common.
- Interpretation of correlation metrics.

Numerical values fall within two categories:

- Discrete - the values are confined to a certain limit and are integers. An example could be a binary value consisting of 0/1.
- Continuous - these values can be divided in a way to construct different scales to look at subpopulations. For example, a measurement of age within a population provides continuous values. When you are looking into the subpopulations within that data set, you can simply create custom ranges (ages 1 through 10) to look at distributions.

** Here is some code to look at how these values look given their distribution:**

In [19]:

```
%matplotlib inline
import numpy as np
import seaborn as sns
import pandas as pd
```

In [14]:

```
N = 1000
# Generate N random binary values
random_binary = np.random.choice([0, 1], size=(N,))
# Generate N random ages from 0 to 100
random_ages = np.random.randint(0, 100, N)
```

In [3]:

```
# Plot binary distribution
ax = sns.distplot(random_binary)
ax.set_title('Binary Distribution')
print('mean', np.mean(random_binary))
```

In [16]:

```
# Plot age distribution
ax = sns.distplot(random_ages)
ax.set_title('Continuous Distribution\nAges 0 to 100')
print('mean', np.mean(random_ages))
```

These types of numbers are simply used to group specific samples. They do not provide any mathematical meaning. For example, a 1 for male and 2 for female does not provide any importance.

Categorical values that provide weight with their meanings. It is essentially a mix of numerical and categorical values. For example, a ranking system.

For a race you can think of the following as individuals are given prizes:

- 1st place
- 2nd place
- 3rd place

or

A pain chart provided at a hospital in which 1 through 10 expresses the amount of pain you are experiencing.

The Pearson Correlation Coeffecient measures the relationship of variables in a linear fashion on an X and Y coordinate system. The values range from -1 to +1. When the value is closest to -1, the relationship is negatively correlated. When the value is closest to +1, the relationship is positively correlated. Having a value close to 0 means that there is no significant correlation between X and Y.

Below we look at comparing age to salary with values that create a strong positive correlation.

In [35]:

```
ages = [18, 18, 25, 27, 30, 38]
wages = [18000, 24000, 55000, 61000, 65000, 88000]
df = pd.DataFrame({'ages': ages, 'wages': wages})
```

In [30]:

```
sns.lmplot('ages', 'wages', data=df)
```

Out[30]:

In [31]:

```
df.corr()
```

Out[31]:

You can see that these variables have a strong positive correlation. They are very close to 1.

The Spearman Rank Order Correlation measures the relationship of ordinal, interval or ratio variables. These values must be monotonic in nature. Essentially the variables both decrease or increase together.

In [44]:

```
x = np.arange(0, 100)
x2 = np.arange(1, 5)
y = (1/4) * (x**2)
y2 = np.exp(-x2)
y3 = np.sin(x / 5)
```

In [49]:

```
# Plot increasing relation
df = pd.DataFrame({'x': x, 'y': y})
ax = sns.lmplot('x', 'y', data=df)
sns.plt.title('Monotonically Increasing')
```

Out[49]:

In [50]:

```
df.corr(method='spearman')
```

Out[50]:

In [51]:

```
# Plot decreasing relation
df = pd.DataFrame({'x': x2, 'y': y2})
ax = sns.lmplot('x', 'y', data=df)
sns.plt.title('Monotonically Decreasing')
```

Out[51]:

In [52]:

```
df.corr(method='spearman')
```

Out[52]:

In [55]:

```
# Plot no relation
df = pd.DataFrame({'x': x, 'y': y3})
ax = sns.lmplot('x', 'y', data=df)
sns.plt.title('Not Monotonic')
```

Out[55]:

In [56]:

```
df.corr(method='spearman')
```

Out[56]:

A Point-biserial correlation consists of a binary variable and a non-binary variable. It is the same computation as the Pearson correlation with the exception of having a binary variable.

Let's create a distribution where you become obese as you get older.

In [65]:

```
# Create some values
obese = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
ages = [10, 20, 30, 35, 40, 45, 50, 55, 60, 70]
df = pd.DataFrame({'obese': obese, 'age': ages})
```

In [66]:

```
df.corr()
```

Out[66]:

In [68]:

```
df.plot.scatter('age', 'obese')
```

Out[68]: