In [1]:
from datascience import *
import numpy as np
In [2]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In this lecture, I am going to use more interactive plots (they look better) so I am using the plotly.express library. We won't test you on this but it's good to know.

In [3]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Lecture 29¶

Predicting Child Heights¶

Recall, long ago, in lecture 10 we built a function to predict child heights. We started with Galton's height dataset which contained the full grown heigh of children and the height's of both of their parents. We then computed the average height of the parents of each child.

The following is the simplified version of the data containing just the parent's heights and the child height.

In [4]:
# Note: Child heights are the **adult** heights of children in a family
families = Table.read_table('family_heights.csv')
parent_avgs = (families.column('father') + families.column('mother'))/2
heights = Table().with_columns(
    'Parent Average', parent_avgs,
    'Child', families.column('child'),
)
heights
Out[4]:
Parent Average Child
72.75 73.2
72.75 69.2
72.75 69
72.75 69
71 73.5
71 72.5
71 65.5
71 65.5
69.5 71
69.5 68

... (924 rows omitted)

What was the relationship between height of the full grown child and the height of the parents?

In [5]:
heights.iscatter('Parent Average', 'Child')

The Nearest Neighbor Predictions¶

Could we use this data to help us predict the height of a newborn child given the parent's height?

In lecture 10, we actually developed a highly sophisticated process for predicting the height of a child given the average heigh of both their parents. We looked at children of parents with similar heights in our data and then took the average height of those nearby children.

In [6]:
def nearest_neighbor_predictor(parent_average, window=0.5):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    similar_child_heights = (
        heights
            .where("Parent Average", are.between(lower_bound, upper_bound))
            .column("Child")
    )
    return np.mean(similar_child_heights)
In [7]:
my_height = 5*12 + 11 # 5 ft 11 inches
spouse_height = 5*12 + 7 # 5 ft 7 inches
our_average = (my_height + spouse_height) / 2.0
our_average
Out[7]:
69.0
In [8]:
nearest_neighbor_predictor(our_average)
Out[8]:
67.988059701492531

In pictures, this nearest neighbor predictor looks like.

In [9]:
fig = px.scatter(x=heights.column('Parent Average'), y=heights.column('Child'))
fig.add_vline(our_average - 0.5)
fig.add_vline(our_average + 0.5)
fig.add_scatter(x=[our_average], y=[nearest_neighbor_predictor(our_average)], 
                name="Prediction", marker_size=10)

To get a sense as to how well our predictor works, we can apply it to each of the records in our dataset. Of course, we already know the height of the children for each of the records but this gives us a simple way to evaluate our predictions when we know the answer.

In [10]:
heights_with_predictions = heights.with_columns(
    'Prediction', heights.apply(nearest_neighbor_predictor, 'Parent Average'))
In [11]:
heights_with_predictions.iscatter('Parent Average')

The yellow line above is not actually a line but a curve. It is actually a fairly advanced model capable of capturing complex non-linear relationships. However, for many activities in data science we will be interested in a simple line that approximates the yellow curve. In this and the next few lectures we will build an intuition for the properties of this line and it's derivation.

We still start with a mathematical description of the linear association between two variables: a numerical measure of how closely two variables follow a line.





Return to Slides




Association¶

We already saw one example of an association between two variables:

In [12]:
heights_with_predictions.iscatter('Parent Average')

Let's look at another dataset consist of hybrid cars. This dataset contains the vehicle model, the year it was released, the manufacturers suggested retail price (msrp), the acceleration (in km/h/s so bigger is better), fuel efficiency (mpg), and the type of car (class).

In [13]:
hybrid = Table.read_table('hybrid.csv')
hybrid.show(5)
vehicle year msrp acceleration mpg class
Prius (1st Gen) 1997 24509.7 7.46 41.26 Compact
Tino 2000 35355 8.2 54.1 Compact
Prius (2nd Gen) 2000 26832.2 7.97 45.23 Compact
Insight 2000 18936.4 9.52 53 Two Seater
Civic (1st Gen) 2001 25833.4 7.04 47.04 Compact

... (148 rows omitted)

There are some expensive hybrids...

In [14]:
hybrid.sort('msrp', descending=True)
Out[14]:
vehicle year msrp acceleration mpg class
Lexus LS600h/hL 2007 118544 17.54 21 Midsize
ActiveHybrid 7 2010 104300 20.41 22.11 Large
ActiveHybrid 7i 2011 102606 18.18 20 Midsize
ActiveHybrid X6 2009 97237.9 17.96 18.82 SUV
S400 Long 2009 96208.9 13.89 26.34 Large
Panamera S 2013 96150 18.52 25 Large
Panamera S 2012 95283.9 17.54 25 Large
S400 2013 92350 13.89 21 Large
S400 2010 88212.8 12.99 21 Large
ActiveHybrid 7L 2013 84300 18.18 25 Large

... (143 rows omitted)

The first step in studying an association is to visualize the data.

In [15]:
px.scatter(hybrid.to_df(), 
           x="msrp", 
           y="mpg", 
           hover_name="vehicle", 
           color="class")
In [16]:
px.scatter(hybrid.to_df(), 
           x="msrp", 
           y="acceleration", 
           hover_name="vehicle", 
           color="class")

We could even consider plotting at MPG, acceleration, and price all at once.

In [17]:
px.scatter(hybrid.to_df(), 
           x="msrp", 
           y="acceleration", 
           size="mpg",
           hover_name="vehicle", 
           color="class")
In [18]:
px.scatter(hybrid.to_df(), 
           x="mpg", 
           y="acceleration", 
           size="msrp",
           hover_name="vehicle", 
           color="class")

What kinds of associations do we observe?





Return to Slides




Correlation¶

Correlation is a measure of the linear relationship between two variables. Before we show you how to compute correlation, let's build an intuition for what it means. To do that we will use the following helper function to generate data with different correlation values.

This is a helper function that generates and plots synthetic data with a given $r$ value. You are not expected to understand how this function works as it is well beyond the scope of this class.

In [19]:
def make_correlated_data(r, n=500):
    "Generate a a table with columns x and y with a correlation of approximately r"
    x = np.random.normal(0, 1, n)
    z = np.random.normal(0, 1, n)
    y = r*x + (np.sqrt(1-r**2))*z
    return Table().with_columns("x", x, "y", y)

def r_scatter(r, n=500, subplot=False):
    "Generate a scatter plot with a correlation approximately r"
    data = make_correlated_data(r, n)
    plt = go.Scatter(x=data.column("x"), y= data.column("y"), 
                     name=str(r), mode="markers")
    if subplot:
        return plt
    else: 
        return go.Figure().add_trace(plt)
In [20]:
r_scatter(0.8)
In [21]:
figs = make_subplots(2,3)
n=200
figs.add_trace(r_scatter(0.2,  n, subplot=True), 1, 1)
figs.add_trace(r_scatter(0.5,  n, subplot=True), 1, 2)
figs.add_trace(r_scatter(0.8,  n, subplot=True), 1, 3)
figs.add_trace(r_scatter(-0.2, n, subplot=True), 2, 1)
figs.add_trace(r_scatter(-0.5, n, subplot=True), 2, 2)
figs.add_trace(r_scatter(-0.8, n, subplot=True), 2, 3)




Computing the Correlation¶

To derive the correlation, we start by converting our data to standard units.

Recall in previous lectures we introduced a function to transform our data into standard units.

In [22]:
def standard_units(x):
    "Convert any array of numbers to standard units."
    return (x - np.average(x)) / np.std(x)

Lets add standard unit (SU) versions of the mpg, msrp, and acceleration to our hybrid table.

In [23]:
hybrid = hybrid.with_columns(
    "mpg (SU)", standard_units(hybrid.column('mpg')),
    "msrp (SU)", standard_units(hybrid.column('msrp')),
    "acceleration (SU)", standard_units(hybrid.column('acceleration')),
)
hybrid.show(5)
vehicle year msrp acceleration mpg class mpg (SU) msrp (SU) acceleration (SU)
Prius (1st Gen) 1997 24509.7 7.46 41.26 Compact 0.59091 -0.69363 -1.53501
Tino 2000 35355 8.2 54.1 Compact 1.76495 -0.18568 -1.2825
Prius (2nd Gen) 2000 26832.2 7.97 45.23 Compact 0.953911 -0.584852 -1.36098
Insight 2000 18936.4 9.52 53 Two Seater 1.66437 -0.954663 -0.832081
Civic (1st Gen) 2001 25833.4 7.04 47.04 Compact 1.11941 -0.631636 -1.67832

... (148 rows omitted)

How does this change the plots?

In [24]:
px.scatter(hybrid.to_df(), 
           x="msrp", 
           y="acceleration", 
           hover_name="vehicle", 
           color="class")
In [25]:
px.scatter(hybrid.to_df(), 
           x="msrp (SU)", 
           y="acceleration (SU)", 
           hover_name="vehicle", 
           color="class")

I could not plot the marker size in standard units because marker sizes must be a positive integer.




The correlation is the average of the product of the standard units of each variable.

\begin{align} r & = \frac{1}{n}\sum_{i=1}^n \left( \frac{x_i - \text{Mean}(x)}{\text{Stdev}(x)} \right) * \left( \frac{y_i - \text{Mean}(y)}{\text{Stdev}(y)} \right) \\ & = \frac{1}{n} \sum_{i=1}^n \text{StandardUnits}(x_i) * \text{StandardUnits}(y_i)\\ & = \text{Mean}\left(\text{StandardUnits}(x) * \text{StandardUnits}(y)\right) \end{align}
In [26]:
np.mean(hybrid.column("acceleration (SU)") * hybrid.column("msrp (SU)"))
Out[26]:
0.69557789969139783

A positive correlation close to 1 would mean that when acceleration is larger than the mean then msrp should also be larger than the mean. Looking at the histogram of the product we see:

In [27]:
Table().with_column(
    "Product", hybrid.column("acceleration (SU)") * hybrid.column("msrp (SU)")
).ihist("Product", bins=20)

Defining the Correlation Function¶

Let's define a function that computes the correlation between two columns in a table.

In [28]:
def correlation(t, x, y):
    """t is a table; x and y are column labels"""
    x_in_su = standard_units(t.column(x))
    y_in_su = standard_units(t.column(y))
    return np.mean(x_in_su * y_in_su)


Solution

def correlation(t, x, y):
    """t is a table; x and y are column labels"""
    x_in_su = standard_units(t.column(x))
    y_in_su = standard_units(t.column(y))
    return np.mean(x_in_su * y_in_su)


</details>

Can you guess the value of the correlation for each of the following relationships:

In [29]:
fig = make_subplots(1,3)
fig.add_scatter(x=hybrid.column("msrp"), y=hybrid.column("mpg"), 
                mode="markers", row=1, col=1)
fig.add_scatter(x=hybrid.column("msrp"), y=hybrid.column("acceleration"), 
                mode="markers", row=1, col=2)
fig.add_scatter(x=hybrid.column("mpg"), y=hybrid.column("acceleration"), 
                mode="markers", row=1, col=3)
fig.update_xaxes(title_text="msrp", row=1, col=1)
fig.update_yaxes(title_text="mpg", row=1, col=1)
fig.update_xaxes(title_text="msrp", row=1, col=2)
fig.update_yaxes(title_text="acceleration", row=1, col=2)
fig.update_xaxes(title_text="mpg", row=1, col=3)
fig.update_yaxes(title_text="acceleration", row=1, col=3)
fig.update_layout(showlegend=False)
In [30]:
correlation(hybrid, "msrp", "mpg")
Out[30]:
-0.53182636336837863
In [31]:
correlation(hybrid, "msrp", "acceleration")
Out[31]:
0.69557789969139783
In [32]:
correlation(hybrid, "mpg", "acceleration")
Out[32]:
-0.5060703843771186




Switching Axes¶

What happens if we swap the axes?

In [33]:
px.scatter(hybrid.to_df(), 
           x="msrp", 
           y="acceleration", 
           hover_name="vehicle", 
           color="class")
In [34]:
px.scatter(hybrid.to_df(), 
           x="acceleration", 
           y="msrp", 
           hover_name="vehicle", 
           color="class")
In [35]:
correlation(hybrid, "msrp", "acceleration")
Out[35]:
0.69557789969139783
In [36]:
correlation(hybrid, "acceleration", "msrp")
Out[36]:
0.69557789969139783
Solution Switching axes doesn't affect the correlation. It is a symmetric function.





Return to Slides




Care when Interpreting Correlation¶

When computing correlation it is important to always visualize your data first and then consider each of the following issues.




Correlation does Not Imply Causation¶

We have covered this one extensively at this point.




Nonlinearity¶

Low correlation does not imply absence of a relationship. Correlation measures linear relationships. Data with strong non-linear relationship may have very low correlation.

In [37]:
new_x = np.arange(-4, 4.1, 0.5)
nonlinear = Table().with_columns(
        'x', new_x,
        'y', new_x**2
    )
nonlinear.iscatter('x', 'y')

There is clearly a relationship to this data. Given the value of $x$ you can easily predict the value of $y$. What is the correlation?

In [38]:
correlation(nonlinear, 'x', 'y')
Out[38]:
0.0

As a quick aside, how would our nearest neighbor predictor work on this non-linear data.

In [39]:
def nn_predictor(x):
    return np.mean(nonlinear.where("x", are.between(x-0.51, x+0.51)).column("y"))
In [40]:
(
    nonlinear.with_column("Prediction", nonlinear.apply(nn_predictor, "x"))
            .iscatter("x")
)




Outliers¶

Outliers can have a significant effect on correlation.

In [41]:
line = Table().with_columns(
        'x', make_array(1, 2, 3, 4),
        'y', make_array(1, 2, 3, 4)
    )
line.iscatter('x', 'y')
In [42]:
correlation(line, 'x', 'y')
Out[42]:
1.0
In [43]:
outlier = Table().with_columns(
        'x', make_array(1, 2, 3, 4, 5),
        'y', make_array(1, 2, 3, 4, 0)
    )
outlier.iscatter('x', 'y')
In [44]:
correlation(outlier, 'x', 'y')
Out[44]:
0.0

Ecological Correlations¶

The correlation between aggregated variables (e.g., after grouping) may be much higher than the correlation between the underlying variables.

In [45]:
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014
Out[45]:
State Participation Rate Critical Reading Math Writing Combined
Alabama 6.7 547 538 532 1617
Alaska 54.2 507 503 475 1485
Arizona 36.4 522 525 500 1547
Arkansas 4.2 573 571 554 1698
California 60.3 498 510 496 1504
Colorado 14.3 582 586 567 1735
Connecticut 88.4 507 510 508 1525
Delaware 100 456 459 444 1359
District of Columbia 100 440 438 431 1309
Florida 72.2 491 485 472 1448

... (41 rows omitted)

In [46]:
sat2014.iscatter('Critical Reading', 'Math')
In [47]:
correlation(sat2014, 'Critical Reading', 'Math')
Out[47]:
0.98475584110674341

That is a very strong correlation. However, each data point corresponds to a large cloud of data points where each person might have had greater variability in their scores.





Bonus: Understanding the SAT data¶

While we have the data loaded. Does anyone have a guess which dots correspond to which state?

In [48]:
px.scatter(sat2014.to_df(), 
           x = "Critical Reading",
           y = "Math",
           hover_name = "State",
           size = "Participation Rate")