In [1]:
from datascience import *
import numpy as np
In [2]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In this lecture, I am going to use more interactive plots (they look better) so I am using the plotly.express library. We won't test you on this but it's good to know.

In [3]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Lecture 31¶

In this lecture, we will explore the use of optimization to find the "best" model and derive least-squares regression.

Review From Last Lecture¶

In the last lecture we developed the equations of slope and intercept using the correlation coefficient $r$.

Standard Units¶

$$ \text{StandardUnits}(x) = \frac{x - \text{Mean}(x)}{\text{Stdev}(x)} $$
In [4]:
def standard_units(x):
    """Converts an array x to standard units"""
    return (x - np.mean(x)) / np.std(x)

Correlation¶

$$ \begin{align} r & = \text{Mean}\left(\text{StandardUnits}(x) * \text{StandardUnits}(y)\right)\\ & = \frac{1}{n} \sum_{i=1}^n \text{StandardUnits}(x_i) * \text{StandardUnits}(y_i)\\ & = \frac{1}{n}\sum_{i=1}^n \left( \frac{x_i - \text{Mean}(x)}{\text{Stdev}(x)} \right) * \left( \frac{y_i - \text{Mean}(y)}{\text{Stdev}(y)} \right) \\ \end{align} $$
In [5]:
def correlation(t, x, y):
    """Computes the correlation between columns x and y"""
    x_su = standard_units(t.column(x))
    y_su = standard_units(t.column(y))
    return np.mean(x_su * y_su)

Slope and Intercept¶

$$ \begin{align} \text{slope} &= r * \frac{\text{Stdev}(y)}{\text{Stdev}(x)}\\ \text{intercept} & = \text{Mean}(y) - \text{slope} * \text{Mean}(x) \end{align} $$
In [6]:
def slope(t, x, y):
    """Computes the slope of the regression line"""
    r = correlation(t, x, y)
    y_sd = np.std(t.column(y))
    x_sd = np.std(t.column(x))
    return r * y_sd / x_sd
In [7]:
def intercept(t, x, y):
    """Computes the intercept of the regression line"""
    x_mean = np.mean(t.column(x))
    y_mean = np.mean(t.column(y))
    return y_mean - slope(t, x, y)*x_mean

Linear Prediction¶

$$ y_\text{predicted} = \text{slope} * x + \text{intercept} $$
In [8]:
def predict_linear(t, x, y):
    """Return an array of the regressions estimates at all the x values"""
    pred_y = slope(t, x, y) * t.column(x) + intercept(t, x, y)
    return pred_y





Making Predictions with Linear Regression¶

We can now compute predictions, but how good are they? How do we know that we have a good linear fit? To study this we will consider a new dataset.

In [9]:
demographics = Table.read_table('district_demographics2016.csv')
demographics.show(5)
State District Median Income Percent voting for Clinton College%
Alabama Congressional District 1 (115th Congress), Alabama 47083 34.1 24
Alabama Congressional District 2 (115th Congress), Alabama 42035 33 21.8
Alabama Congressional District 3 (115th Congress), Alabama 46544 32.3 22.8
Alabama Congressional District 4 (115th Congress), Alabama 41110 17.4 17
Alabama Congressional District 5 (115th Congress), Alabama 51690 31.3 30.3

... (430 rows omitted)

In [10]:
px.scatter(demographics.to_df(), 
           x="College%", 
           y="Median Income",
           color="State")
In [11]:
correlation(demographics, 'College%', 'Median Income')
Out[11]:
0.81846485171413352

Discussion Question: Any concerns about the correlation computation being done here?




Making Predictions¶

Here we will try to predict the income for each district as a function of the percent of college educated people.

In [12]:
demo_slope = slope(demographics, 'College%', 'Median Income')
demo_intercept = intercept(demographics, 'College%', 'Median Income')
print("Slope:", demo_slope)
print("Intercept:", demo_intercept)
Slope: 1270.70168946
Intercept: 20802.5777667

Make the actual predictions.

In [13]:
demographics = demographics.with_column(
    "Linear Prediction", predict_linear(demographics, 'College%', 'Median Income')
)
demographics
Out[13]:
State District Median Income Percent voting for Clinton College% Linear Prediction
Alabama Congressional District 1 (115th Congress), Alabama 47083 34.1 24 51299.4
Alabama Congressional District 2 (115th Congress), Alabama 42035 33 21.8 48503.9
Alabama Congressional District 3 (115th Congress), Alabama 46544 32.3 22.8 49774.6
Alabama Congressional District 4 (115th Congress), Alabama 41110 17.4 17 42404.5
Alabama Congressional District 5 (115th Congress), Alabama 51690 31.3 30.3 59304.8
Alabama Congressional District 6 (115th Congress), Alabama 61413 26.1 36.7 67437.3
Alabama Congressional District 7 (115th Congress), Alabama 34664 69.8 19.4 45454.2
Alaska Congressional District (at Large) (115th Congress), Alaska 76440 37.6 29.6 58415.3
Arizona Congressional District 1 (115th Congress), Arizona 50537 46.6 24.5 51934.8
Arizona Congressional District 2 (115th Congress), Arizona 49072 49.6 34 64006.4

... (425 rows omitted)

Visualizing the predictions:

In [14]:
demographics.iscatter('College%', ["Median Income", "Linear Prediction"])
In [15]:
fig = px.scatter(demographics.to_df(), x="College%", y="Median Income")
xtest = np.arange(0, 75, 1)
fig.add_scatter(x=xtest, 
                y=demo_slope * xtest + demo_intercept,
                name = f"{np.round(demo_slope, 2)} x + {np.round(demo_intercept)}")
fig




Is this a good fit? Is it the best fit?





Return to Slides




Computing the Error¶

The error is the difference between the actual and predicted value:

$$ \text{error} = y - y_\text{predicted} $$

In a future lecture, we will refer to this error as the residual.

In [16]:
y = demographics.column('Median Income')
predicted = predict_linear(demographics, 'College%', 'Median Income')

errors = y - predicted
In [17]:
demographics = demographics.with_column('Error', errors)
demographics
Out[17]:
State District Median Income Percent voting for Clinton College% Linear Prediction Error
Alabama Congressional District 1 (115th Congress), Alabama 47083 34.1 24 51299.4 -4216.42
Alabama Congressional District 2 (115th Congress), Alabama 42035 33 21.8 48503.9 -6468.87
Alabama Congressional District 3 (115th Congress), Alabama 46544 32.3 22.8 49774.6 -3230.58
Alabama Congressional District 4 (115th Congress), Alabama 41110 17.4 17 42404.5 -1294.51
Alabama Congressional District 5 (115th Congress), Alabama 51690 31.3 30.3 59304.8 -7614.84
Alabama Congressional District 6 (115th Congress), Alabama 61413 26.1 36.7 67437.3 -6024.33
Alabama Congressional District 7 (115th Congress), Alabama 34664 69.8 19.4 45454.2 -10790.2
Alaska Congressional District (at Large) (115th Congress), Alaska 76440 37.6 29.6 58415.3 18024.7
Arizona Congressional District 1 (115th Congress), Arizona 50537 46.6 24.5 51934.8 -1397.77
Arizona Congressional District 2 (115th Congress), Arizona 49072 49.6 34 64006.4 -14934.4

... (425 rows omitted)



What are the districts with the largest error values?

In [18]:
(demographics
     .with_column("Abs Error", np.abs(demographics.column("Error")))
     .sort("Abs Error", descending=True)
)
Out[18]:
State District Median Income Percent voting for Clinton College% Linear Prediction Error Abs Error
New York Congressional District 2 (115th Congress), New York 93362 43.9 31.2 60448.5 32913.5 32913.5
Maryland Congressional District 5 (115th Congress), Maryland 95442 64.1 33.7 63625.2 31816.8 31816.8
Virginia Congressional District 10 (115th Congress), Virginia 120384 52.2 54.3 89801.7 30582.3 30582.3
California Congressional District 15 (115th Congress), California 106291 69.9 45 77984.2 28306.8 28306.8
California Congressional District 17 (115th Congress), California 121150 73.9 57 93232.6 27917.4 27917.4
California Congressional District 19 (115th Congress), California 93139 72.9 35.5 65912.5 27226.5 27226.5
California Congressional District 18 (115th Congress), California 125790 73.4 61.5 98950.7 26839.3 26839.3
New York Congressional District 1 (115th Congress), New York 90435 42.2 34.1 64133.5 26301.5 26301.5
Pennsylvania Congressional District 2 (115th Congress), Pennsylvania 41320 72.9 36.7 67437.3 -26117.3 26117.3
Georgia Congressional District 5 (115th Congress), Georgia 48269 85 42 74172 -25903 25903

... (425 rows omitted)

What would a large error suggest?




Visualizing the Errors¶

In [19]:
fig = px.scatter(demographics.to_df(), x="College%", y="Median Income", hover_name="District")
xtest = np.arange(0, 75, 1)
fig.add_scatter(x=xtest, 
                y=demo_slope * xtest + demo_intercept,
                name = f"{np.round(demo_slope, 2)} x + {np.round(demo_intercept)}")
fig.add_scatter(x=demographics.column("College%").repeat(3), 
                y=np.ravel(np.vstack([y, predicted, np.nan * predicted]).T),
                marker_color="gray", line_width=0.75, name="Errors")
fig
In [20]:
demographics.ihist('Error', bins=50, rug=True)





Summarizing the Overall Error¶

What is the average error?

In [21]:
np.mean(errors)
Out[21]:
6.3560089503211536e-13

Mean Absolute Error

In [22]:
np.mean(np.abs(errors))
Out[22]:
7235.9076108913441

Mean Squared Error (MSE)

In [23]:
np.mean(errors ** 2)
Out[23]:
88332095.268617392

Root Mean Squared Error (RMSE)

In [24]:
np.sqrt(np.mean(errors ** 2))
Out[24]:
9398.5155885712811



Discussion Question¶

Assuming $y$ is income in dollars. What are the units of:

  1. Mean Absolute Error
  2. Mean Squared Error
  3. Root Mean Squared Error





Error as Function of our Model (Line)¶

In [25]:
def demographics_rmse(slope, intercept):
    predicted = slope * demographics.column("College%") + intercept 
    actual = demographics.column("Median Income")
    errors = predicted - actual
    rmse = np.sqrt(np.mean(errors ** 2))
    return rmse

The value of our error function for the slope and intercept we derived in last lecture is:

In [26]:
demographics_rmse(demo_slope, demo_intercept)
Out[26]:
9398.5155885712811

What if we used a different slope and intercept value:

In [27]:
def visualize_demographics_rmse(slope, intercept):
    rmse = demographics_rmse(slope, intercept)
    predicted = slope * demographics.column("College%") + intercept 
    actual = demographics.column("Median Income")
    fig = px.scatter(demographics.to_df(), x="College%", y="Median Income")
    xtest = np.arange(0, 75, 1)
    fig.add_scatter(x=xtest, y=slope * xtest + intercept,
                    name = f"{np.round(slope, 2)} x + {np.round(intercept)}")
    fig.add_scatter(x=demographics.column("College%").repeat(3), 
                    y=np.ravel(np.vstack([actual, predicted, np.nan * predicted]).T),
                    marker_color="gray", line_width=0.75, name="Errors")
    fig.update_layout(title=f"RMSE = {np.round(rmse, 2)}")
    return fig
In [28]:
visualize_demographics_rmse(demo_slope, demo_intercept)
In [29]:
visualize_demographics_rmse(demo_slope+1000, demo_intercept - 50000)





Varying the Slope and Intercept and Plotting the RMSE

In [30]:
alt_slopes = demo_slope + np.arange(-20, 20)
rmses = []
for new_slope in alt_slopes:
    rmses = np.append(rmses, demographics_rmse(new_slope, demo_intercept))

variations = Table().with_columns("Slope", alt_slopes, "RMSE", rmses)
In [31]:
fig = px.scatter(variations.to_df(), x="Slope", y="RMSE")
fig.add_scatter(x=[demo_slope], y=[demographics_rmse(demo_slope, demo_intercept)], marker_size=10, 
                name="Best Slope")

What if we tried to change the intercept value while using the best slope so far?

In [32]:
alt_intercepts = demo_intercept + np.arange(-2000, 2000, 100)
rmses = []
for new_intercept in alt_intercepts:
    rmses = np.append(rmses, demographics_rmse(demo_slope, new_intercept))

variations = Table().with_columns("Intercept", alt_intercepts, "RMSE", rmses)
fig = px.scatter(variations.to_df(), x="Intercept", y="RMSE")
fig.add_scatter(x=[demo_intercept], y=[demographics_rmse(demo_slope, demo_intercept)], marker_size=10, 
                name="Best Intercept")

What if we tried changing both the slope and the intercept at the same time?

In [33]:
alt_slopes = demo_slope + np.arange(-100, 100, 1)
alt_intercepts = demo_intercept + np.arange(-2000, 2000, 10)
variations = Table(["Slope", "Intercept", "RMSE"])
for new_slope in alt_slopes:
    for new_intercept in alt_intercepts:
        rmse = demographics_rmse(new_slope, new_intercept)
        variations.append([new_slope, new_intercept, rmse])
    
variations
go.Figure(data=[
    go.Contour(x=variations.column("Slope"), y=variations.column("Intercept"), z=variations.column("RMSE")), 
    go.Scatter(x=[demo_slope], y=[demo_intercept], marker_color="red")
],
layout=dict(width = 800,height=600, xaxis_title="Slope", yaxis_title="Intercept"))

In three dimensions:

In [34]:
go.Figure(data=[
    go.Surface(x = alt_slopes, y = alt_intercepts,
               z=variations.column("RMSE").reshape(len(alt_slopes), len(alt_intercepts)).T),
    go.Scatter3d(x=[demo_slope], y=[demo_intercept], z=[demographics_rmse(demo_slope, demo_intercept)])], 
          layout=dict(height=1000, 
                      scene_xaxis_title="Slope", scene_yaxis_title="Intercept", 
                      scene_zaxis_title="RMSE"))





Return to Slides




Numerical Optimization¶

If our goal is just to find the parameters of our line that minimize some kind of error, we can use numerical optimization tools. Suppose we wanted to minimize the function:

$$ f(x) = \left(x - 2\right)^2 + 3 $$
In [35]:
def f(x):
    return ((x-2)**2) + 3
In [36]:
x = np.arange(1, 3, 0.1)
y = f(x)
px.line(x=x, y=y)

We could use the minimize function in the datascience package. The function returns the arguments to the functions that result in the minimum value:

$$ \texttt{minimize(f)}\, = \, \arg\min_{a,b, \ldots} f(a,b\ldots) $$
In [37]:
minimize(f)
Out[37]:
1.9999999946252267
In [38]:
print("x_min =", minimize(f))
print("f(x_min) =", f(minimize(f)))
x_min = 1.9999999946252267
f(x_min) = 3.0
In [39]:
fig = px.line(x=x, y=y)
fig.add_scatter(x=[minimize(f)], y=[f(minimize(f))],
                name="Minimum", marker_color="red", marker_size=10)

Minimize works for even more complex functions.

$$ f(x) = 2 * \sin(\pi x) + x^3 + x^4 + \sin(10x) $$
In [40]:
def complicated_function(x):
    return 2 * np.sin(x*np.pi) + x ** 3 + x ** 4 + np.sin(x * 10)
In [41]:
x = np.arange(-1.5, 1.5, 0.01)
y2 = complicated_function(x)
px.line(x=x, y=y2)

We can still use minimize to find the minimum:

In [42]:
x_min = minimize(complicated_function)
print("x_min =", x_min)
print("f(x_min) =", complicated_function(x_min))
x_min = -0.740999069950806
f(x_min) = -2.46205656866
In [43]:
fig = px.line(x=x, y=y2)
fig.add_scatter(x=[x_min],
                y=[complicated_function(x_min)],
                name="Minimum", marker_color="red", marker_size=10)

We can even minimize multidimensional functions:

$$ \texttt{surface_function(a,b)} = -\frac{\cos\left(\pi \sqrt{(a+0.5)^2 + b^2}\right)}{\sqrt{(a+0.5)^2 + b^2} + 1} $$
In [44]:
def surface_function(a, b):
    d = np.sqrt( (a+0.5)**2 + b**2 )
    return -np.cos(np.pi* d) / (d**2 + 1)
In [45]:
a_min, b_min = minimize(surface_function)
[a_min, b_min]
Out[45]:
[-0.49999999726226257, 2.7824593980664285e-11]

Visualizing the surface and the minimum:

In [46]:
xs = np.arange(-1.5, 1.5, 0.01)
ys = np.arange(-1.5, 1.5, 0.01)
x, y = np.meshgrid(xs, ys)
zs = surface_function(x.flatten(), y.flatten())
go.Figure(data=[
    go.Surface(x = xs, y = ys,
               z=zs.reshape(len(xs), len(ys))),
    go.Scatter3d(x=[a_min], y=[b_min], z=[surface_function(a_min, b_min)])
    ], 
    layout=dict(height=1000, 
                scene_xaxis_title="a", scene_yaxis_title="b", 
                scene_zaxis_title="surface"))





Minimizing RMSE¶

We can use minimize to find the slope and intercept that minimize root mean squared error in our predictions:

In [47]:
minimize(demographics_rmse)
Out[47]:
array([  1270.70168805,  20802.57933807])

How does this compare to the slope and intercept we derived earlier?

In [48]:
[demo_slope, demo_intercept]
Out[48]:
[1270.70168946388, 20802.577766677925]

What happens if we minimize the mean squared error instead of the root mean squared error?

In [49]:
def demographics_mse(slope, intercept):
    x = demographics.column('College%')
    y = demographics.column('Median Income')
    estimate = slope*x + intercept
    return np.mean(((y - estimate) ** 2))
In [50]:
minimize(demographics_mse)
Out[50]:
array([  1270.70168947,  20802.5777658 ])

What if we wanted to use the absolute error instead?

In [51]:
def demographics_mae(any_slope, any_intercept):
    x = demographics.column('College%')
    y = demographics.column('Median Income')
    estimate = any_slope*x + any_intercept
    return np.mean(np.abs(y - estimate))
In [52]:
minimize(demographics_mae)
Out[52]:
array([  1264.71948402,  20001.85902309])

That is different!

In [53]:
mae_slope, mae_intercept = minimize(demographics_mae)
fig = px.scatter(demographics.to_df(), x="College%", y="Median Income", color="State")
xtest = np.arange(0, 75, 0.1)
fig.add_scatter(x=xtest, 
                y=demo_slope * xtest + demo_intercept,
                name = f"Least Squares: {np.round(demo_slope, 2)} x + {np.round(demo_intercept)}")
fig.add_scatter(x=xtest, 
                y=mae_slope * xtest + mae_intercept,
                name = f"MAE: {np.round(mae_slope, 2)} x + {np.round(mae_intercept)}")
fig





Return to Slides








Multiple Linear Regression¶

We can also use multiple variables to help predict a single variable.

In [54]:
hybrid = Table.read_table('hybrid.csv')
hybrid.show(5)
vehicle year msrp acceleration mpg class
Prius (1st Gen) 1997 24509.7 7.46 41.26 Compact
Tino 2000 35355 8.2 54.1 Compact
Prius (2nd Gen) 2000 26832.2 7.97 45.23 Compact
Insight 2000 18936.4 9.52 53 Two Seater
Civic (1st Gen) 2001 25833.4 7.04 47.04 Compact

... (148 rows omitted)

In [55]:
px.scatter_3d(
    hybrid.to_df(), 
    x="mpg", y="acceleration", z="msrp",
    hover_name="vehicle", 
    color="class", 
    height=800
)

Suppose we use the model:

$$ y = a * \text{acc} + b * \text{mpg} + c $$
In [56]:
def hybrid_rmse(a, b, c):
    actual = hybrid.column("msrp")
    acc = hybrid.column("acceleration")
    mpg = hybrid.column("mpg")
    predicted = a*acc + b*mpg + c
    mse = np.sqrt(np.mean((actual - predicted)**2))
    return mse
In [57]:
a, b, c = minimize(hybrid_rmse)
[a, b, c]
Out[57]:
[4119.6597287965033, -513.26014324389826, 7636.7196373778297]
In [58]:
print(f"Error: {hybrid_rmse(a, b, c):,}")
Error: 14,687.539382720333
In [59]:
mpg_range = np.arange(10, 80)
acceleration_range = np.arange(5, 25)
predictions = Table(["mpg", "acc", "pred"])
for mpg in mpg_range:
    for acc in acceleration_range: 
        pred = a * acc + b * mpg + c
        predictions.append([mpg, acc, pred])
In [60]:
fig = px.scatter_3d(
    hybrid.to_df(), 
    x="mpg", y="acceleration", z="msrp",
    hover_name="vehicle", 
    color="class", 
    height=800
)
fig.add_surface(
    x = mpg_range, y = acceleration_range,
    z = predictions.column("pred").reshape(len(mpg_range), len(acceleration_range)).T,
    showscale=False
)

Fitting Non-linear Data¶

We could try to improve our predictions by defining a more complex equation:

$$ \text{y} = a * \text{acc} + b *\text{mpg} + c * \text{acc}^2 + d * \text{mpg}^2 + e $$
In [61]:
def hybrid_better_rmse(a, b, c, d, e):
    actual = hybrid.column("msrp")
    acc = hybrid.column("acceleration")
    mpg = hybrid.column("mpg")
    predicted = a*acc + b*mpg + c*acc**2 + d*mpg**2 + e
    mse = np.sqrt(np.mean((actual - predicted)**2))
    return mse
In [62]:
a, b, c, d, e = minimize(hybrid_better_rmse)
[a, b, c, d, e]
Out[62]:
[-8278.7149344607205,
 -2634.1314044379506,
 467.28859068643095,
 27.309648987149739,
 122783.24092048552]
In [63]:
print(f"Error: {hybrid_better_rmse(a,b,c,d,e):,}")
Error: 13,035.853740271548
In [64]:
mpg_range = np.arange(10, 80)
acceleration_range = np.arange(5, 25)
predictions = Table(["mpg", "acc", "pred"])
for mpg in mpg_range:
    for acc in acceleration_range: 
        pred = a*acc + b*mpg + c*acc**2 + d*mpg**2 + e
        predictions.append([mpg, acc, pred])
        
fig = px.scatter_3d(
    hybrid.to_df(), 
    x="mpg", y="acceleration", z="msrp",
    hover_name="vehicle", 
    color="class", 
    height=800
)
fig.add_surface(
    x = mpg_range, y = acceleration_range,
    z = predictions.column("pred").reshape(len(mpg_range), len(acceleration_range)).T,
    showscale=False
)
In [ ]:
 
In [ ]: