In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Lecture 10¶

In today's lecture, we will:

  1. review functions and applying functions to tables by building a simple but sophisticated prediction function.
  2. we will introduce the group operation.






Prediction¶

Can we predict how tall a child will grow based on the height of their parents?

To do this we will use the famous Galton's height dataset that was collected to demonstrate the connection between parent's heights and the height of their children.

In [2]:
families = Table.read_table('family_heights.csv')
families
Out[2]:
family father mother child children order sex
1 78.5 67 73.2 4 1 male
1 78.5 67 69.2 4 2 female
1 78.5 67 69 4 3 female
1 78.5 67 69 4 4 female
2 75.5 66.5 73.5 4 1 male
2 75.5 66.5 72.5 4 2 male
2 75.5 66.5 65.5 4 3 female
2 75.5 66.5 65.5 4 4 female
3 75 64 71 2 1 male
3 75 64 68 2 2 female

... (924 rows omitted)




Discussion: This data was collected for Europeans living in the late 1800s. What are some of the potential issues with this data?







Exploring the Data¶

Exercise: Add a column "parent average" containing the average height of both parents.

In [3]:
families = families.with_column(
    "parent average", (families.column('father') + families.column('mother'))/2.0
)
families
Out[3]:
family father mother child children order sex parent average
1 78.5 67 73.2 4 1 male 72.75
1 78.5 67 69.2 4 2 female 72.75
1 78.5 67 69 4 3 female 72.75
1 78.5 67 69 4 4 female 72.75
2 75.5 66.5 73.5 4 1 male 71
2 75.5 66.5 72.5 4 2 male 71
2 75.5 66.5 65.5 4 3 female 71
2 75.5 66.5 65.5 4 4 female 71
3 75 64 71 2 1 male 69.5
3 75 64 68 2 2 female 69.5

... (924 rows omitted)


Click for Solution

families = families.with_column(
    "parent average", (families.column('father') + families.column('mother'))/2.0
)
families









What is the relationship between a child's height and the average parent's height?

Exercise: Make a scatter plot showing the relationship between the "parent average" and the "child" height.

In [4]:
families.scatter("parent average", "child")
Click for Solution

families.scatter("parent average", "child")



Questions:

  1. Do we observe a relationship between child and parent height?
  2. Would a line plot help reveal that relationship?








Making a Prediction¶

If we wanted to predict the height of a child given the height of the parents, we could look at the heigh of children with parents who have a similar average height.

In [5]:
my_height = 5*12 + 11 # 5 ft 11 inches
spouse_height = 5*12 + 7 # 5 ft 7 inches
In [6]:
our_average = (my_height + spouse_height) / 2.0
our_average
Out[6]:
69.0

Let's look at parents that are within 1 inch of our height.

In [7]:
window = 1 
lower_bound = our_average - window
upper_bound = our_average + window
In [8]:
families.scatter('parent average', 'child')
# You don't need to know the details of this plotting code yet.
plots.plot([lower_bound, lower_bound], [50, 85], color='red', lw=2)
plots.plot([our_average, our_average], [50, 85], color='orange', lw=2);
plots.plot([upper_bound, upper_bound], [50, 85], color='red', lw=2);





Exercise: Create a function that takes an average of the parents heights and returns an array of all the children's heights that are within the window of the parent's average height.

In [9]:
def similar_child_heights(parent_average):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return (
        families
            .where("parent average", are.between(lower_bound, upper_bound))
            .column("child")
    )
Click for Solution

def similar_child_heights(parent_average):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return (
        families
            .where("parent average", are.between(lower_bound, upper_bound))
            .column("child")
    )



Testing the function:

In [10]:
# window = 1.0
similar_child_heights(our_average)
Out[10]:
array([ 71. ,  68. ,  70.5,  68.5,  67. ,  64.5,  63. ,  65.5,  74. ,
        70. ,  68. ,  67. ,  67. ,  66. ,  63.5,  63. ,  71. ,  70.5,
        66.7,  72. ,  70.5,  70.2,  70.2,  69.2,  68.7,  66.5,  64.5,
        63.5,  74. ,  73. ,  71.5,  62.5,  66.5,  62.3,  66. ,  64.5,
        64. ,  62.7,  73. ,  71. ,  67. ,  74.2,  70.5,  69.5,  66. ,
        65.5,  65. ,  65. ,  65.5,  66. ,  63. ,  67.5,  67.2,  66.7,
        73.2,  73. ,  69. ,  67. ,  70. ,  67. ,  67. ,  66.5,  70. ,
        69. ,  68.5,  66. ,  64.5,  63. ,  71. ,  67. ,  76. ,  72. ,
        71. ,  66. ,  66. ,  70.5,  72. ,  72. ,  71. ,  69. ,  66. ,
        65. ,  73. ,  65.2,  68.5,  67.7,  68. ,  68. ,  62. ,  72. ,
        71. ,  70.5,  67. ,  72. ,  71. ,  70. ,  66. ,  64.5,  64.5,
        62. ,  71. ,  70. ,  69. ,  69. ,  70. ,  68.7,  68. ,  66. ,
        64. ,  62. ,  75. ,  70. ,  69. ,  66. ,  64. ,  60. ,  67.5,
        73. ,  72. ,  72. ,  66.5,  69.2,  67.2,  66.5,  66. ,  66. ,
        64.2,  63.7,  75. ,  71. ,  70. ,  66. ,  66. ,  65.5,  65. ,
        65. ,  64. ,  64. ,  64. ,  70.5,  67.5,  64.5,  64. ,  71. ,  61.7])





Exercise: Create a function to predict the child's height as the average of the height of children within the window of the average parent height.

In [11]:
def predict_child_height(parent_average):
    return np.average(similar_child_heights(parent_average))
Click for Solution

def predict_child_height(parent_average):
    return np.average(similar_child_heights(parent_average))



In [12]:
predict_child_height(our_average)
Out[12]:
67.799310344827589




Let's plot the predicted height as well as the distribution of children's heights:

In [13]:
# window = 1.0
similar = similar_child_heights(our_average)
predicted_height = predict_child_height(our_average)

print("Mean:", predicted_height)
Table().with_column("child", similar).hist("child", bins=20)
plots.plot([predicted_height, predicted_height], [0, .1], color="red")
Mean: 67.7993103448
Out[13]:
[<matplotlib.lines.Line2D at 0x14eebdea0>]

Discussion: Is this a good predictor? How would I know?








Evaluating the Predictions¶

To evaluate the predictions, let's see how the predictions compare to the actual heights of all the children in our dataset.

Exercise: Apply the function (using apply) to all the parent averages in the table and save the result to the "predicted" column.

In [14]:
# window = 0.5
families = families.with_column(
    "predicted", families.apply(predict_child_height, "parent average"))
families
Out[14]:
family father mother child children order sex parent average predicted
1 78.5 67 73.2 4 1 male 72.75 70.1
1 78.5 67 69.2 4 2 female 72.75 70.1
1 78.5 67 69 4 3 female 72.75 70.1
1 78.5 67 69 4 4 female 72.75 70.1
2 75.5 66.5 73.5 4 1 male 71 69.9971
2 75.5 66.5 72.5 4 2 male 71 69.9971
2 75.5 66.5 65.5 4 3 female 71 69.9971
2 75.5 66.5 65.5 4 4 female 71 69.9971
3 75 64 71 2 1 male 69.5 68.2092
3 75 64 68 2 2 female 69.5 68.2092

... (924 rows omitted)

Click for Solution

# window = 0.5
families = families.with_column(
    "predicted", families.apply(predict_child_height, "parent average"))
families







Exercise: Construct a scatter plot with the "parent average" height on the x-axis and the "child" height and the "predicted" height on the y-axis.

In [15]:
(
    families
    .select('parent average','child', 'predicted')
    .scatter('parent average')
)
Click for Solution

(
    families
    .select('parent average','child', 'predicted')
    .scatter('parent average')
)



Discussion: What do we see in this plot? What trends.





Exercise: Define a function to compute the error (the difference) between the predicted value and the true value and apply that function to the table adding a column containing the "error". Then construct a histogram of the errors.

In [16]:
def error(predicted, true_value):
    return predicted - true_value

families = families.with_column(
    "error", families.apply(error, "predicted", "child"))
families
Out[16]:
family father mother child children order sex parent average predicted error
1 78.5 67 73.2 4 1 male 72.75 70.1 -3.1
1 78.5 67 69.2 4 2 female 72.75 70.1 0.9
1 78.5 67 69 4 3 female 72.75 70.1 1.1
1 78.5 67 69 4 4 female 72.75 70.1 1.1
2 75.5 66.5 73.5 4 1 male 71 69.9971 -3.50286
2 75.5 66.5 72.5 4 2 male 71 69.9971 -2.50286
2 75.5 66.5 65.5 4 3 female 71 69.9971 4.49714
2 75.5 66.5 65.5 4 4 female 71 69.9971 4.49714
3 75 64 71 2 1 male 69.5 68.2092 -2.79083
3 75 64 68 2 2 female 69.5 68.2092 0.209174

... (924 rows omitted)

Click for Solution

def error(predicted, true_value):
    return predicted - true_value

families = families.with_column(
    "error", families.apply(error, "predicted", "child"))
families



Visualizing the distribution of the errors:

In [17]:
families.hist('error')

Discussion: Is this good?










Exercise: Overlay the histograms of the error for male and female children.

In [18]:
families.hist('error', group='sex')
Click for Solution

families.hist('error', group='sex')



Discussion: What do we observe?








Building a Better Predictor¶

Based on what we observed, let's build a better predictor.

Exercise: Implement a new height prediction function that considers averages the height of children with the same sex and whose parents had a similar height.

Hint: Here is the previous function:

def similar_child_heights(parent_average):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return np.average(
        families
            .where("parent average", are.between(lower_bound, upper_bound))
            .column("child")
    )
In [19]:
def predict_child_height_with_sex(parent_average, sex):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return np.average(
        families
        .where("sex", sex)
        .where("parent average", are.between(lower_bound, upper_bound))
        .column("child")
    )
Click for Solution

def predict_child_height_with_sex(parent_average, sex):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return np.average(
        families
        .where("sex", sex)
        .where("parent average", are.between(lower_bound, upper_bound))
        .column("child")
    )



Let's test it out.

In [20]:
predict_child_height_with_sex(our_average, "male")
Out[20]:
70.640298507462674
In [21]:
predict_child_height_with_sex(our_average, "female")
Out[21]:
65.358974358974365






Exercise: Apply the better predictor to the table and save the predictions in a column called "predicted with sex".

In [22]:
families = families.with_column(
    "predicted with sex", families.apply(predict_child_height_with_sex, "parent average", "sex"))
families
Out[22]:
family father mother child children order sex parent average predicted error predicted with sex
1 78.5 67 73.2 4 1 male 72.75 70.1 -3.1 73.2
1 78.5 67 69.2 4 2 female 72.75 70.1 0.9 69.0667
1 78.5 67 69 4 3 female 72.75 70.1 1.1 69.0667
1 78.5 67 69 4 4 female 72.75 70.1 1.1 69.0667
2 75.5 66.5 73.5 4 1 male 71 69.9971 -3.50286 72.7882
2 75.5 66.5 72.5 4 2 male 71 69.9971 -2.50286 72.7882
2 75.5 66.5 65.5 4 3 female 71 69.9971 4.49714 67.3611
2 75.5 66.5 65.5 4 4 female 71 69.9971 4.49714 67.3611
3 75 64 71 2 1 male 69.5 68.2092 -2.79083 70.9566
3 75 64 68 2 2 female 69.5 68.2092 0.209174 65.6089

... (924 rows omitted)

Click for Solution

families = families.with_column(
    "predicted with sex", families.apply(predict_child_height_with_sex, "parent average", "sex"))
families







Exercise: Construct a histogram of the new errors broken down by the sex of the child.

In [23]:
families = families.with_column("error with sex", 
                                families.apply(error, "predicted with sex", "child"))

families.hist("error with sex", group="sex")

As a point of comparison

In [24]:
families.hist("error", group="sex")








Return to slides


Grouping¶

For this part of the notebook we will use the following toy data:

In [25]:
cones = Table.read_table('cones.csv')
cones
Out[25]:
Flavor Color Price Rating
strawberry pink 3.55 1
chocolate light brown 4.75 4
chocolate dark brown 5.25 3
strawberry pink 5.25 2
chocolate dark brown 5.25 5
bubblegum pink 4.75 1





Exercise: Use the group function to determine the number of cones with each flavor.

In [26]:
cones.group('Flavor')
Out[26]:
Flavor count
bubblegum 1
chocolate 3
strawberry 2
Click for Solution

cones.group('Flavor')







Exercise: Use the group function to compute the average price of cones for each flavor.

In [27]:
cones.group('Flavor', np.average)
Out[27]:
Flavor Color average Price average Rating average
bubblegum 4.75 1
chocolate 5.08333 4
strawberry 4.4 1.5
Click for Solution

cones.group('Flavor', np.average)







Exercise: Use the group function to compute min price of cones for each flavor.

In [28]:
cones.group('Flavor', np.min)
Out[28]:
Flavor Color amin Price amin Rating amin
bubblegum 4.75 1
chocolate 4.75 3
strawberry 3.55 1
Click for Solution

cones.group('Flavor', np.min)



What is really going on:

In [29]:
cones
Out[29]:
Flavor Color Price Rating
strawberry pink 3.55 1
chocolate light brown 4.75 4
chocolate dark brown 5.25 3
strawberry pink 5.25 2
chocolate dark brown 5.25 5
bubblegum pink 4.75 1
In [30]:
def my_grp(grp):
    print(grp)
    return grp

cones.group("Flavor", my_grp)
['pink']
['light brown' 'dark brown' 'dark brown']
['pink' 'pink']
[ 4.75]
[ 4.75  5.25  5.25]
[ 3.55  5.25]
[1]
[4 3 5]
[1 2]
Out[30]:
Flavor Color my_grp Price my_grp Rating my_grp
bubblegum ['pink'] [ 4.75] [1]
chocolate ['light brown' 'dark brown' 'dark brown'] [ 4.75 5.25 5.25] [4 3 5]
strawberry ['pink' 'pink'] [ 3.55 5.25] [1 2]