In [1]:
from datascience import *
import numpy as np
In [2]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In this lecture, I am going to use more interactive plots (they look better) so I am using the plotly.express library. We won't test you on this but it's good to know.

In [3]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Lecture 31¶

In this lecture, we will explore the use of optimization to find the "best" model and derive least-squares regression.

Review From Last Lecture¶

In the last lecture we developed the equations of slope and intercept using the correlation coefficient $r$.

Standard Units¶

$$ \text{StandardUnits}(x) = \frac{x - \text{Mean}(x)}{\text{Stdev}(x)} $$
In [4]:
def standard_units(x):
    """Converts an array x to standard units"""
    return (x - np.mean(x)) / np.std(x)

Correlation¶

$$ \begin{align} r & = \text{Mean}\left(\text{StandardUnits}(x) * \text{StandardUnits}(y)\right)\\ & = \frac{1}{n} \sum_{i=1}^n \text{StandardUnits}(x_i) * \text{StandardUnits}(y_i)\\ & = \frac{1}{n}\sum_{i=1}^n \left( \frac{x_i - \text{Mean}(x)}{\text{Stdev}(x)} \right) * \left( \frac{y_i - \text{Mean}(y)}{\text{Stdev}(y)} \right) \\ \end{align} $$
In [5]:
def correlation(t, x, y):
    """Computes the correlation between columns x and y"""
    x_su = standard_units(t.column(x))
    y_su = standard_units(t.column(y))
    return np.mean(x_su * y_su)

Slope and Intercept¶

$$ \begin{align} \text{slope} &= r * \frac{\text{Stdev}(y)}{\text{Stdev}(x)}\\ \text{intercept} & = \text{Mean}(y) - \text{slope} * \text{Mean}(x) \end{align} $$
In [6]:
def slope(t, x, y):
    """Computes the slope of the regression line"""
    r = correlation(t, x, y)
    y_sd = np.std(t.column(y))
    x_sd = np.std(t.column(x))
    return r * y_sd / x_sd
In [7]:
def intercept(t, x, y):
    """Computes the intercept of the regression line"""
    x_mean = np.mean(t.column(x))
    y_mean = np.mean(t.column(y))
    return y_mean - slope(t, x, y)*x_mean

Linear Prediction¶

$$ y_\text{predicted} = \text{slope} * x + \text{intercept} $$
In [8]:
def predict_linear(t, x, y):
    """Return an array of the regressions estimates at all the x values"""
    pred_y = slope(t, x, y) * t.column(x) + intercept(t, x, y)
    return pred_y





Making Predictions with Linear Regression¶

We can now compute predictions, but how good are they? How do we know that we have a good linear fit? To study this we will consider a new dataset.

In [9]:
demographics = Table.read_table('district_demographics2016.csv')
demographics.show(5)
State District Median Income Percent voting for Clinton College%
Alabama Congressional District 1 (115th Congress), Alabama 47083 34.1 24
Alabama Congressional District 2 (115th Congress), Alabama 42035 33 21.8
Alabama Congressional District 3 (115th Congress), Alabama 46544 32.3 22.8
Alabama Congressional District 4 (115th Congress), Alabama 41110 17.4 17
Alabama Congressional District 5 (115th Congress), Alabama 51690 31.3 30.3

... (430 rows omitted)

In [10]:
px.scatter(demographics.to_df(), 
           x="College%", 
           y="Median Income",
           color="State")