from datascience import *
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
In this lecture, I am going to use more interactive plots (they look better) so I am using the plotly.express library. We won't test you on this but it's good to know.
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
In this lecture, we will explore the use of optimization to find the "best" model and derive least-squares regression.
In the last lecture we developed the equations of slope and intercept using the correlation coefficient $r$.
def standard_units(x):
"""Converts an array x to standard units"""
return (x - np.mean(x)) / np.std(x)
def correlation(t, x, y):
"""Computes the correlation between columns x and y"""
x_su = standard_units(t.column(x))
y_su = standard_units(t.column(y))
return np.mean(x_su * y_su)
def slope(t, x, y):
"""Computes the slope of the regression line"""
r = correlation(t, x, y)
y_sd = np.std(t.column(y))
x_sd = np.std(t.column(x))
return r * y_sd / x_sd
def intercept(t, x, y):
"""Computes the intercept of the regression line"""
x_mean = np.mean(t.column(x))
y_mean = np.mean(t.column(y))
return y_mean - slope(t, x, y)*x_mean
def predict_linear(t, x, y):
"""Return an array of the regressions estimates at all the x values"""
pred_y = slope(t, x, y) * t.column(x) + intercept(t, x, y)
return pred_y
We can now compute predictions, but how good are they? How do we know that we have a good linear fit? To study this we will consider a new dataset.
demographics = Table.read_table('district_demographics2016.csv')
demographics.show(5)
State | District | Median Income | Percent voting for Clinton | College% |
---|---|---|---|---|
Alabama | Congressional District 1 (115th Congress), Alabama | 47083 | 34.1 | 24 |
Alabama | Congressional District 2 (115th Congress), Alabama | 42035 | 33 | 21.8 |
Alabama | Congressional District 3 (115th Congress), Alabama | 46544 | 32.3 | 22.8 |
Alabama | Congressional District 4 (115th Congress), Alabama | 41110 | 17.4 | 17 |
Alabama | Congressional District 5 (115th Congress), Alabama | 51690 | 31.3 | 30.3 |
... (430 rows omitted)
px.scatter(demographics.to_df(),
x="College%",
y="Median Income",
color="State")