from datascience import *
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
In this lecture, I am going to use more interactive plots (they look better) so I am using the plotly.express library. We won't test you on this but it's good to know.
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
Recall, long ago, in lecture 10 we built a function to predict child heights. We started with Galton's height dataset which contained the full grown heigh of children and the height's of both of their parents. We then computed the average height of the parents of each child.
The following is the simplified version of the data containing just the parent's heights and the child height.
# Note: Child heights are the **adult** heights of children in a family
families = Table.read_table('family_heights.csv')
parent_avgs = (families.column('father') + families.column('mother'))/2
heights = Table().with_columns(
'Parent Average', parent_avgs,
'Child', families.column('child'),
)
heights
Parent Average | Child |
---|---|
72.75 | 73.2 |
72.75 | 69.2 |
72.75 | 69 |
72.75 | 69 |
71 | 73.5 |
71 | 72.5 |
71 | 65.5 |
71 | 65.5 |
69.5 | 71 |
69.5 | 68 |
... (924 rows omitted)
What was the relationship between height of the full grown child and the height of the parents?
heights.iscatter('Parent Average', 'Child')
Could we use this data to help us predict the height of a newborn child given the parent's height?
In lecture 10, we actually developed a highly sophisticated process for predicting the height of a child given the average heigh of both their parents. We looked at children of parents with similar heights in our data and then took the average height of those nearby children.
def nearest_neighbor_predictor(parent_average, window=0.5):
lower_bound = parent_average - window
upper_bound = parent_average + window
similar_child_heights = (
heights
.where("Parent Average", are.between(lower_bound, upper_bound))
.column("Child")
)
return np.mean(similar_child_heights)
my_height = 5*12 + 11 # 5 ft 11 inches
spouse_height = 5*12 + 7 # 5 ft 7 inches
our_average = (my_height + spouse_height) / 2.0
our_average
69.0
nearest_neighbor_predictor(our_average)
67.988059701492531
In pictures, this nearest neighbor predictor looks like.
fig = px.scatter(x=heights.column('Parent Average'), y=heights.column('Child'))
fig.add_vline(our_average - 0.5)
fig.add_vline(our_average + 0.5)
fig.add_scatter(x=[our_average], y=[nearest_neighbor_predictor(our_average)],
name="Prediction", marker_size=10)
To get a sense as to how well our predictor works, we can apply it to each of the records in our dataset. Of course, we already know the height of the children for each of the records but this gives us a simple way to evaluate our predictions when we know the answer.
heights_with_predictions = heights.with_columns(
'Prediction', heights.apply(nearest_neighbor_predictor, 'Parent Average'))
heights_with_predictions.iscatter('Parent Average')
The yellow line above is not actually a line but a curve. It is actually a fairly advanced model capable of capturing complex non-linear relationships. However, for many activities in data science we will be interested in a simple line that approximates the yellow curve. In this and the next few lectures we will build an intuition for the properties of this line and it's derivation.
We still start with a mathematical description of the linear association between two variables: a numerical measure of how closely two variables follow a line.
We already saw one example of an association between two variables:
heights_with_predictions.iscatter('Parent Average')
Let's look at another dataset consist of hybrid cars. This dataset contains the vehicle model, the year it was released, the manufacturers suggested retail price (msrp), the acceleration (in km/h/s so bigger is better), fuel efficiency (mpg), and the type of car (class).
hybrid = Table.read_table('hybrid.csv')
hybrid.show(5)
vehicle | year | msrp | acceleration | mpg | class |
---|---|---|---|---|---|
Prius (1st Gen) | 1997 | 24509.7 | 7.46 | 41.26 | Compact |
Tino | 2000 | 35355 | 8.2 | 54.1 | Compact |
Prius (2nd Gen) | 2000 | 26832.2 | 7.97 | 45.23 | Compact |
Insight | 2000 | 18936.4 | 9.52 | 53 | Two Seater |
Civic (1st Gen) | 2001 | 25833.4 | 7.04 | 47.04 | Compact |
... (148 rows omitted)
There are some expensive hybrids...
hybrid.sort('msrp', descending=True)
vehicle | year | msrp | acceleration | mpg | class |
---|---|---|---|---|---|
Lexus LS600h/hL | 2007 | 118544 | 17.54 | 21 | Midsize |
ActiveHybrid 7 | 2010 | 104300 | 20.41 | 22.11 | Large |
ActiveHybrid 7i | 2011 | 102606 | 18.18 | 20 | Midsize |
ActiveHybrid X6 | 2009 | 97237.9 | 17.96 | 18.82 | SUV |
S400 Long | 2009 | 96208.9 | 13.89 | 26.34 | Large |
Panamera S | 2013 | 96150 | 18.52 | 25 | Large |
Panamera S | 2012 | 95283.9 | 17.54 | 25 | Large |
S400 | 2013 | 92350 | 13.89 | 21 | Large |
S400 | 2010 | 88212.8 | 12.99 | 21 | Large |
ActiveHybrid 7L | 2013 | 84300 | 18.18 | 25 | Large |
... (143 rows omitted)
The first step in studying an association is to visualize the data.
px.scatter(hybrid.to_df(),
x="msrp",
y="mpg",
hover_name="vehicle",
color="class")
px.scatter(hybrid.to_df(),
x="msrp",
y="acceleration",
hover_name="vehicle",
color="class")
We could even consider plotting at MPG, acceleration, and price all at once.
px.scatter(hybrid.to_df(),
x="msrp",
y="acceleration",
size="mpg",
hover_name="vehicle",
color="class")
px.scatter(hybrid.to_df(),
x="mpg",
y="acceleration",
size="msrp",
hover_name="vehicle",
color="class")
What kinds of associations do we observe?
Correlation is a measure of the linear relationship between two variables. Before we show you how to compute correlation, let's build an intuition for what it means. To do that we will use the following helper function to generate data with different correlation values.
This is a helper function that generates and plots synthetic data with a given $r$ value. You are not expected to understand how this function works as it is well beyond the scope of this class.
def make_correlated_data(r, n=500):
"Generate a a table with columns x and y with a correlation of approximately r"
x = np.random.normal(0, 1, n)
z = np.random.normal(0, 1, n)
y = r*x + (np.sqrt(1-r**2))*z
return Table().with_columns("x", x, "y", y)
def r_scatter(r, n=500, subplot=False):
"Generate a scatter plot with a correlation approximately r"
data = make_correlated_data(r, n)
plt = go.Scatter(x=data.column("x"), y= data.column("y"),
name=str(r), mode="markers")
if subplot:
return plt
else:
return go.Figure().add_trace(plt)
r_scatter(0.8)
figs = make_subplots(2,3)
n=200
figs.add_trace(r_scatter(0.2, n, subplot=True), 1, 1)
figs.add_trace(r_scatter(0.5, n, subplot=True), 1, 2)
figs.add_trace(r_scatter(0.8, n, subplot=True), 1, 3)
figs.add_trace(r_scatter(-0.2, n, subplot=True), 2, 1)
figs.add_trace(r_scatter(-0.5, n, subplot=True), 2, 2)
figs.add_trace(r_scatter(-0.8, n, subplot=True), 2, 3)
To derive the correlation, we start by converting our data to standard units.
Recall in previous lectures we introduced a function to transform our data into standard units.
def standard_units(x):
"Convert any array of numbers to standard units."
return (x - np.average(x)) / np.std(x)
Lets add standard unit (SU) versions of the mpg
, msrp
, and acceleration
to our hybrid table.
hybrid = hybrid.with_columns(
"mpg (SU)", standard_units(hybrid.column('mpg')),
"msrp (SU)", standard_units(hybrid.column('msrp')),
"acceleration (SU)", standard_units(hybrid.column('acceleration')),
)
hybrid.show(5)
vehicle | year | msrp | acceleration | mpg | class | mpg (SU) | msrp (SU) | acceleration (SU) |
---|---|---|---|---|---|---|---|---|
Prius (1st Gen) | 1997 | 24509.7 | 7.46 | 41.26 | Compact | 0.59091 | -0.69363 | -1.53501 |
Tino | 2000 | 35355 | 8.2 | 54.1 | Compact | 1.76495 | -0.18568 | -1.2825 |
Prius (2nd Gen) | 2000 | 26832.2 | 7.97 | 45.23 | Compact | 0.953911 | -0.584852 | -1.36098 |
Insight | 2000 | 18936.4 | 9.52 | 53 | Two Seater | 1.66437 | -0.954663 | -0.832081 |
Civic (1st Gen) | 2001 | 25833.4 | 7.04 | 47.04 | Compact | 1.11941 | -0.631636 | -1.67832 |
... (148 rows omitted)
How does this change the plots?
px.scatter(hybrid.to_df(),
x="msrp",
y="acceleration",
hover_name="vehicle",
color="class")
px.scatter(hybrid.to_df(),
x="msrp (SU)",
y="acceleration (SU)",
hover_name="vehicle",
color="class")
I could not plot the marker size in standard units because marker sizes must be a positive integer.
The correlation is the average of the product of the standard units of each variable.
\begin{align} r & = \frac{1}{n}\sum_{i=1}^n \left( \frac{x_i - \text{Mean}(x)}{\text{Stdev}(x)} \right) * \left( \frac{y_i - \text{Mean}(y)}{\text{Stdev}(y)} \right) \\ & = \frac{1}{n} \sum_{i=1}^n \text{StandardUnits}(x_i) * \text{StandardUnits}(y_i)\\ & = \text{Mean}\left(\text{StandardUnits}(x) * \text{StandardUnits}(y)\right) \end{align}np.mean(hybrid.column("acceleration (SU)") * hybrid.column("msrp (SU)"))
0.69557789969139783
A positive correlation close to 1 would mean that when acceleration is larger than the mean then msrp should also be larger than the mean. Looking at the histogram of the product we see:
Table().with_column(
"Product", hybrid.column("acceleration (SU)") * hybrid.column("msrp (SU)")
).ihist("Product", bins=20)
Let's define a function that computes the correlation between two columns in a table.
def correlation(t, x, y):
"""t is a table; x and y are column labels"""
x_in_su = standard_units(t.column(x))
y_in_su = standard_units(t.column(y))
return np.mean(x_in_su * y_in_su)
Solution
def correlation(t, x, y):
"""t is a table; x and y are column labels"""
x_in_su = standard_units(t.column(x))
y_in_su = standard_units(t.column(y))
return np.mean(x_in_su * y_in_su)
</details>
Can you guess the value of the correlation for each of the following relationships:
fig = make_subplots(1,3)
fig.add_scatter(x=hybrid.column("msrp"), y=hybrid.column("mpg"),
mode="markers", row=1, col=1)
fig.add_scatter(x=hybrid.column("msrp"), y=hybrid.column("acceleration"),
mode="markers", row=1, col=2)
fig.add_scatter(x=hybrid.column("mpg"), y=hybrid.column("acceleration"),
mode="markers", row=1, col=3)
fig.update_xaxes(title_text="msrp", row=1, col=1)
fig.update_yaxes(title_text="mpg", row=1, col=1)
fig.update_xaxes(title_text="msrp", row=1, col=2)
fig.update_yaxes(title_text="acceleration", row=1, col=2)
fig.update_xaxes(title_text="mpg", row=1, col=3)
fig.update_yaxes(title_text="acceleration", row=1, col=3)
fig.update_layout(showlegend=False)
correlation(hybrid, "msrp", "mpg")
-0.53182636336837863
correlation(hybrid, "msrp", "acceleration")
0.69557789969139783
correlation(hybrid, "mpg", "acceleration")
-0.5060703843771186
px.scatter(hybrid.to_df(),
x="msrp",
y="acceleration",
hover_name="vehicle",
color="class")
px.scatter(hybrid.to_df(),
x="acceleration",
y="msrp",
hover_name="vehicle",
color="class")
correlation(hybrid, "msrp", "acceleration")
0.69557789969139783
correlation(hybrid, "acceleration", "msrp")
0.69557789969139783
When computing correlation it is important to always visualize your data first and then consider each of the following issues.
Low correlation does not imply absence of a relationship. Correlation measures linear relationships. Data with strong non-linear relationship may have very low correlation.
new_x = np.arange(-4, 4.1, 0.5)
nonlinear = Table().with_columns(
'x', new_x,
'y', new_x**2
)
nonlinear.iscatter('x', 'y')
There is clearly a relationship to this data. Given the value of $x$ you can easily predict the value of $y$. What is the correlation?
correlation(nonlinear, 'x', 'y')
0.0
As a quick aside, how would our nearest neighbor predictor work on this non-linear data.
def nn_predictor(x):
return np.mean(nonlinear.where("x", are.between(x-0.51, x+0.51)).column("y"))
(
nonlinear.with_column("Prediction", nonlinear.apply(nn_predictor, "x"))
.iscatter("x")
)
line = Table().with_columns(
'x', make_array(1, 2, 3, 4),
'y', make_array(1, 2, 3, 4)
)
line.iscatter('x', 'y')
correlation(line, 'x', 'y')
1.0
outlier = Table().with_columns(
'x', make_array(1, 2, 3, 4, 5),
'y', make_array(1, 2, 3, 4, 0)
)
outlier.iscatter('x', 'y')
correlation(outlier, 'x', 'y')
0.0
The correlation between aggregated variables (e.g., after grouping) may be much higher than the correlation between the underlying variables.
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014
State | Participation Rate | Critical Reading | Math | Writing | Combined |
---|---|---|---|---|---|
Alabama | 6.7 | 547 | 538 | 532 | 1617 |
Alaska | 54.2 | 507 | 503 | 475 | 1485 |
Arizona | 36.4 | 522 | 525 | 500 | 1547 |
Arkansas | 4.2 | 573 | 571 | 554 | 1698 |
California | 60.3 | 498 | 510 | 496 | 1504 |
Colorado | 14.3 | 582 | 586 | 567 | 1735 |
Connecticut | 88.4 | 507 | 510 | 508 | 1525 |
Delaware | 100 | 456 | 459 | 444 | 1359 |
District of Columbia | 100 | 440 | 438 | 431 | 1309 |
Florida | 72.2 | 491 | 485 | 472 | 1448 |
... (41 rows omitted)
sat2014.iscatter('Critical Reading', 'Math')
correlation(sat2014, 'Critical Reading', 'Math')
0.98475584110674341
That is a very strong correlation. However, each data point corresponds to a large cloud of data points where each person might have had greater variability in their scores.
While we have the data loaded. Does anyone have a guess which dots correspond to which state?
px.scatter(sat2014.to_df(),
x = "Critical Reading",
y = "Math",
hover_name = "State",
size = "Participation Rate")