from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
In today's lecture, we will:
Can we predict how tall a child will grow based on the height of their parents?
To do this we will use the famous Galton's height dataset that was collected to demonstrate the connection between parent's heights and the height of their children.
families = Table.read_table('family_heights.csv')
families
family | father | mother | child | children | order | sex |
---|---|---|---|---|---|---|
1 | 78.5 | 67 | 73.2 | 4 | 1 | male |
1 | 78.5 | 67 | 69.2 | 4 | 2 | female |
1 | 78.5 | 67 | 69 | 4 | 3 | female |
1 | 78.5 | 67 | 69 | 4 | 4 | female |
2 | 75.5 | 66.5 | 73.5 | 4 | 1 | male |
2 | 75.5 | 66.5 | 72.5 | 4 | 2 | male |
2 | 75.5 | 66.5 | 65.5 | 4 | 3 | female |
2 | 75.5 | 66.5 | 65.5 | 4 | 4 | female |
3 | 75 | 64 | 71 | 2 | 1 | male |
3 | 75 | 64 | 68 | 2 | 2 | female |
... (924 rows omitted)
Discussion: This data was collected for Europeans living in the late 1800s. What are some of the potential issues with this data?
Exercise: Add a column "parent average"
containing the average height of both parents.
families = families.with_column(
"parent average", (families.column('father') + families.column('mother'))/2.0
)
families
family | father | mother | child | children | order | sex | parent average |
---|---|---|---|---|---|---|---|
1 | 78.5 | 67 | 73.2 | 4 | 1 | male | 72.75 |
1 | 78.5 | 67 | 69.2 | 4 | 2 | female | 72.75 |
1 | 78.5 | 67 | 69 | 4 | 3 | female | 72.75 |
1 | 78.5 | 67 | 69 | 4 | 4 | female | 72.75 |
2 | 75.5 | 66.5 | 73.5 | 4 | 1 | male | 71 |
2 | 75.5 | 66.5 | 72.5 | 4 | 2 | male | 71 |
2 | 75.5 | 66.5 | 65.5 | 4 | 3 | female | 71 |
2 | 75.5 | 66.5 | 65.5 | 4 | 4 | female | 71 |
3 | 75 | 64 | 71 | 2 | 1 | male | 69.5 |
3 | 75 | 64 | 68 | 2 | 2 | female | 69.5 |
... (924 rows omitted)
Click for Solution
families = families.with_column(
"parent average", (families.column('father') + families.column('mother'))/2.0
)
families
What is the relationship between a child's height and the average parent's height?
Exercise: Make a scatter plot showing the relationship between the "parent average"
and the "child"
height.
families.scatter("parent average", "child")
families.scatter("parent average", "child")
Questions:
If we wanted to predict the height of a child given the height of the parents, we could look at the heigh of children with parents who have a similar average height.
my_height = 5*12 + 11 # 5 ft 11 inches
spouse_height = 5*12 + 7 # 5 ft 7 inches
our_average = (my_height + spouse_height) / 2.0
our_average
69.0
Let's look at parents that are within 1 inch of our height.
window = 1
lower_bound = our_average - window
upper_bound = our_average + window
families.scatter('parent average', 'child')
# You don't need to know the details of this plotting code yet.
plots.plot([lower_bound, lower_bound], [50, 85], color='red', lw=2)
plots.plot([our_average, our_average], [50, 85], color='orange', lw=2);
plots.plot([upper_bound, upper_bound], [50, 85], color='red', lw=2);
Exercise: Create a function that takes an average of the parents heights and returns an array of all the children's heights that are within the window of the parent's average height.
def similar_child_heights(parent_average):
lower_bound = parent_average - window
upper_bound = parent_average + window
return (
families
.where("parent average", are.between(lower_bound, upper_bound))
.column("child")
)
def similar_child_heights(parent_average):
lower_bound = parent_average - window
upper_bound = parent_average + window
return (
families
.where("parent average", are.between(lower_bound, upper_bound))
.column("child")
)
Testing the function:
# window = 1.0
similar_child_heights(our_average)
array([ 71. , 68. , 70.5, 68.5, 67. , 64.5, 63. , 65.5, 74. , 70. , 68. , 67. , 67. , 66. , 63.5, 63. , 71. , 70.5, 66.7, 72. , 70.5, 70.2, 70.2, 69.2, 68.7, 66.5, 64.5, 63.5, 74. , 73. , 71.5, 62.5, 66.5, 62.3, 66. , 64.5, 64. , 62.7, 73. , 71. , 67. , 74.2, 70.5, 69.5, 66. , 65.5, 65. , 65. , 65.5, 66. , 63. , 67.5, 67.2, 66.7, 73.2, 73. , 69. , 67. , 70. , 67. , 67. , 66.5, 70. , 69. , 68.5, 66. , 64.5, 63. , 71. , 67. , 76. , 72. , 71. , 66. , 66. , 70.5, 72. , 72. , 71. , 69. , 66. , 65. , 73. , 65.2, 68.5, 67.7, 68. , 68. , 62. , 72. , 71. , 70.5, 67. , 72. , 71. , 70. , 66. , 64.5, 64.5, 62. , 71. , 70. , 69. , 69. , 70. , 68.7, 68. , 66. , 64. , 62. , 75. , 70. , 69. , 66. , 64. , 60. , 67.5, 73. , 72. , 72. , 66.5, 69.2, 67.2, 66.5, 66. , 66. , 64.2, 63.7, 75. , 71. , 70. , 66. , 66. , 65.5, 65. , 65. , 64. , 64. , 64. , 70.5, 67.5, 64.5, 64. , 71. , 61.7])
Exercise: Create a function to predict the child's height as the average of the height of children within the window of the average parent height.
def predict_child_height(parent_average):
return np.average(similar_child_heights(parent_average))
def predict_child_height(parent_average):
return np.average(similar_child_heights(parent_average))
predict_child_height(our_average)
67.799310344827589
Let's plot the predicted height as well as the distribution of children's heights:
# window = 1.0
similar = similar_child_heights(our_average)
predicted_height = predict_child_height(our_average)
print("Mean:", predicted_height)
Table().with_column("child", similar).hist("child", bins=20)
plots.plot([predicted_height, predicted_height], [0, .1], color="red")
Mean: 67.7993103448
[<matplotlib.lines.Line2D at 0x14eebdea0>]
Discussion: Is this a good predictor? How would I know?
To evaluate the predictions, let's see how the predictions compare to the actual heights of all the children in our dataset.
Exercise: Apply the function (using apply
) to all the parent averages in the table and save the result to the "predicted"
column.
# window = 0.5
families = families.with_column(
"predicted", families.apply(predict_child_height, "parent average"))
families
family | father | mother | child | children | order | sex | parent average | predicted |
---|---|---|---|---|---|---|---|---|
1 | 78.5 | 67 | 73.2 | 4 | 1 | male | 72.75 | 70.1 |
1 | 78.5 | 67 | 69.2 | 4 | 2 | female | 72.75 | 70.1 |
1 | 78.5 | 67 | 69 | 4 | 3 | female | 72.75 | 70.1 |
1 | 78.5 | 67 | 69 | 4 | 4 | female | 72.75 | 70.1 |
2 | 75.5 | 66.5 | 73.5 | 4 | 1 | male | 71 | 69.9971 |
2 | 75.5 | 66.5 | 72.5 | 4 | 2 | male | 71 | 69.9971 |
2 | 75.5 | 66.5 | 65.5 | 4 | 3 | female | 71 | 69.9971 |
2 | 75.5 | 66.5 | 65.5 | 4 | 4 | female | 71 | 69.9971 |
3 | 75 | 64 | 71 | 2 | 1 | male | 69.5 | 68.2092 |
3 | 75 | 64 | 68 | 2 | 2 | female | 69.5 | 68.2092 |
... (924 rows omitted)
# window = 0.5
families = families.with_column(
"predicted", families.apply(predict_child_height, "parent average"))
families
Exercise: Construct a scatter plot with the "parent average"
height on the x-axis and the "child"
height and the "predicted"
height on the y-axis.
(
families
.select('parent average','child', 'predicted')
.scatter('parent average')
)
(
families
.select('parent average','child', 'predicted')
.scatter('parent average')
)
Discussion: What do we see in this plot? What trends.
Exercise: Define a function to compute the error (the difference) between the predicted value and the true value and apply that function to the table adding a column containing the "error"
. Then construct a histogram of the errors.
def error(predicted, true_value):
return predicted - true_value
families = families.with_column(
"error", families.apply(error, "predicted", "child"))
families
family | father | mother | child | children | order | sex | parent average | predicted | error |
---|---|---|---|---|---|---|---|---|---|
1 | 78.5 | 67 | 73.2 | 4 | 1 | male | 72.75 | 70.1 | -3.1 |
1 | 78.5 | 67 | 69.2 | 4 | 2 | female | 72.75 | 70.1 | 0.9 |
1 | 78.5 | 67 | 69 | 4 | 3 | female | 72.75 | 70.1 | 1.1 |
1 | 78.5 | 67 | 69 | 4 | 4 | female | 72.75 | 70.1 | 1.1 |
2 | 75.5 | 66.5 | 73.5 | 4 | 1 | male | 71 | 69.9971 | -3.50286 |
2 | 75.5 | 66.5 | 72.5 | 4 | 2 | male | 71 | 69.9971 | -2.50286 |
2 | 75.5 | 66.5 | 65.5 | 4 | 3 | female | 71 | 69.9971 | 4.49714 |
2 | 75.5 | 66.5 | 65.5 | 4 | 4 | female | 71 | 69.9971 | 4.49714 |
3 | 75 | 64 | 71 | 2 | 1 | male | 69.5 | 68.2092 | -2.79083 |
3 | 75 | 64 | 68 | 2 | 2 | female | 69.5 | 68.2092 | 0.209174 |
... (924 rows omitted)
def error(predicted, true_value):
return predicted - true_value
families = families.with_column(
"error", families.apply(error, "predicted", "child"))
families
Visualizing the distribution of the errors:
families.hist('error')
Discussion: Is this good?
Exercise: Overlay the histograms of the error for male and female children.
families.hist('error', group='sex')
families.hist('error', group='sex')
Discussion: What do we observe?
Exercise: Implement a new height prediction function that considers averages the height of children with the same sex and whose parents had a similar height.
Hint: Here is the previous function:
def similar_child_heights(parent_average):
lower_bound = parent_average - window
upper_bound = parent_average + window
return np.average(
families
.where("parent average", are.between(lower_bound, upper_bound))
.column("child")
)
def predict_child_height_with_sex(parent_average, sex):
lower_bound = parent_average - window
upper_bound = parent_average + window
return np.average(
families
.where("sex", sex)
.where("parent average", are.between(lower_bound, upper_bound))
.column("child")
)
def predict_child_height_with_sex(parent_average, sex):
lower_bound = parent_average - window
upper_bound = parent_average + window
return np.average(
families
.where("sex", sex)
.where("parent average", are.between(lower_bound, upper_bound))
.column("child")
)
Let's test it out.
predict_child_height_with_sex(our_average, "male")
70.640298507462674
predict_child_height_with_sex(our_average, "female")
65.358974358974365
Exercise: Apply the better predictor to the table and save the predictions in a column called "predicted with sex"
.
families = families.with_column(
"predicted with sex", families.apply(predict_child_height_with_sex, "parent average", "sex"))
families
family | father | mother | child | children | order | sex | parent average | predicted | error | predicted with sex |
---|---|---|---|---|---|---|---|---|---|---|
1 | 78.5 | 67 | 73.2 | 4 | 1 | male | 72.75 | 70.1 | -3.1 | 73.2 |
1 | 78.5 | 67 | 69.2 | 4 | 2 | female | 72.75 | 70.1 | 0.9 | 69.0667 |
1 | 78.5 | 67 | 69 | 4 | 3 | female | 72.75 | 70.1 | 1.1 | 69.0667 |
1 | 78.5 | 67 | 69 | 4 | 4 | female | 72.75 | 70.1 | 1.1 | 69.0667 |
2 | 75.5 | 66.5 | 73.5 | 4 | 1 | male | 71 | 69.9971 | -3.50286 | 72.7882 |
2 | 75.5 | 66.5 | 72.5 | 4 | 2 | male | 71 | 69.9971 | -2.50286 | 72.7882 |
2 | 75.5 | 66.5 | 65.5 | 4 | 3 | female | 71 | 69.9971 | 4.49714 | 67.3611 |
2 | 75.5 | 66.5 | 65.5 | 4 | 4 | female | 71 | 69.9971 | 4.49714 | 67.3611 |
3 | 75 | 64 | 71 | 2 | 1 | male | 69.5 | 68.2092 | -2.79083 | 70.9566 |
3 | 75 | 64 | 68 | 2 | 2 | female | 69.5 | 68.2092 | 0.209174 | 65.6089 |
... (924 rows omitted)
families = families.with_column(
"predicted with sex", families.apply(predict_child_height_with_sex, "parent average", "sex"))
families
Exercise: Construct a histogram of the new errors broken down by the sex of the child.
families = families.with_column("error with sex",
families.apply(error, "predicted with sex", "child"))
families.hist("error with sex", group="sex")
As a point of comparison
families.hist("error", group="sex")
For this part of the notebook we will use the following toy data:
cones = Table.read_table('cones.csv')
cones
Flavor | Color | Price | Rating |
---|---|---|---|
strawberry | pink | 3.55 | 1 |
chocolate | light brown | 4.75 | 4 |
chocolate | dark brown | 5.25 | 3 |
strawberry | pink | 5.25 | 2 |
chocolate | dark brown | 5.25 | 5 |
bubblegum | pink | 4.75 | 1 |
Exercise: Use the group
function to determine the number of cones with each flavor.
cones.group('Flavor')
Flavor | count |
---|---|
bubblegum | 1 |
chocolate | 3 |
strawberry | 2 |
cones.group('Flavor')
Exercise: Use the group
function to compute the average price of cones for each flavor.
cones.group('Flavor', np.average)
Flavor | Color average | Price average | Rating average |
---|---|---|---|
bubblegum | 4.75 | 1 | |
chocolate | 5.08333 | 4 | |
strawberry | 4.4 | 1.5 |
cones.group('Flavor', np.average)
Exercise: Use the group
function to compute min price of cones for each flavor.
cones.group('Flavor', np.min)
Flavor | Color amin | Price amin | Rating amin |
---|---|---|---|
bubblegum | 4.75 | 1 | |
chocolate | 4.75 | 3 | |
strawberry | 3.55 | 1 |
cones.group('Flavor', np.min)
What is really going on:
cones
Flavor | Color | Price | Rating |
---|---|---|---|
strawberry | pink | 3.55 | 1 |
chocolate | light brown | 4.75 | 4 |
chocolate | dark brown | 5.25 | 3 |
strawberry | pink | 5.25 | 2 |
chocolate | dark brown | 5.25 | 5 |
bubblegum | pink | 4.75 | 1 |
def my_grp(grp):
print(grp)
return grp
cones.group("Flavor", my_grp)
['pink'] ['light brown' 'dark brown' 'dark brown'] ['pink' 'pink'] [ 4.75] [ 4.75 5.25 5.25] [ 3.55 5.25] [1] [4 3 5] [1 2]
Flavor | Color my_grp | Price my_grp | Rating my_grp |
---|---|---|---|
bubblegum | ['pink'] | [ 4.75] | [1] |
chocolate | ['light brown' 'dark brown' 'dark brown'] | [ 4.75 5.25 5.25] | [4 3 5] |
strawberry | ['pink' 'pink'] | [ 3.55 5.25] | [1 2] |