from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
In this lecture we will:
Here we simulate the Monty Hall problem. We break the process into three steps.
prizes = make_array("goat", "goat", "car")
N = 10_000
outcomes = Table().with_column("My Choice", np.random.choice(prizes, N))
outcomes
| My Choice |
|---|
| goat |
| car |
| goat |
| goat |
| goat |
| car |
| goat |
| goat |
| goat |
| goat |
... (9990 rows omitted)
outcomes = outcomes.with_column("Monty's Door", "goat")
outcomes
| My Choice | Monty's Door |
|---|---|
| goat | goat |
| car | goat |
| goat | goat |
| goat | goat |
| goat | goat |
| car | goat |
| goat | goat |
| goat | goat |
| goat | goat |
| goat | goat |
... (9990 rows omitted)
def other_door(my_choice):
if my_choice == "car":
return "goat"
else:
return "car"
outcomes = outcomes.with_column(
"Other Door", outcomes.apply(other_door, "My Choice"))
outcomes
| My Choice | Monty's Door | Other Door |
|---|---|---|
| goat | goat | car |
| car | goat | goat |
| goat | goat | car |
| goat | goat | car |
| goat | goat | car |
| car | goat | goat |
| goat | goat | car |
| goat | goat | car |
| goat | goat | car |
| goat | goat | car |
... (9990 rows omitted)
Notice that in the above table each row has two goats and a car. Each row simulates an outcome of playing the game.
If we stayed with our initial choice how often would we get a car?
outcomes.group("My Choice").barh("My Choice")
If we switched to the Other door how often would we win?
outcomes.group("Other Door").barh("Other Door")
Would you switch?
Here we will use a dataset of all United airlines flights from 6/1/15 to 8/9/15. This data contains their destination and how long they were delayed, in minutes.
united = Table.read_table('united.csv')
united = ( # Adding row numbers so we can see samples more easily
united
.with_column('Row', np.arange(united.num_rows))
.move_to_start('Row')
)
united
| Row | Date | Flight Number | Destination | Delay |
|---|---|---|---|---|
| 0 | 6/1/15 | 73 | HNL | 257 |
| 1 | 6/1/15 | 217 | EWR | 28 |
| 2 | 6/1/15 | 237 | STL | -3 |
| 3 | 6/1/15 | 250 | SAN | 0 |
| 4 | 6/1/15 | 267 | PHL | 64 |
| 5 | 6/1/15 | 273 | SEA | -6 |
| 6 | 6/1/15 | 278 | SEA | -8 |
| 7 | 6/1/15 | 292 | EWR | 12 |
| 8 | 6/1/15 | 300 | HNL | 20 |
| 9 | 6/1/15 | 317 | IND | -10 |
... (13815 rows omitted)
For each of the following, is this a deterministic or random sampling strategy?
united.where('Destination', 'JFK')
| Row | Date | Flight Number | Destination | Delay |
|---|---|---|---|---|
| 26 | 6/1/15 | 502 | JFK | -4 |
| 33 | 6/1/15 | 637 | JFK | 141 |
| 39 | 6/1/15 | 704 | JFK | -8 |
| 50 | 6/1/15 | 758 | JFK | -5 |
| 51 | 6/1/15 | 760 | JFK | 352 |
| 56 | 6/1/15 | 824 | JFK | 3 |
| 57 | 6/1/15 | 898 | JFK | 290 |
| 179 | 6/2/15 | 502 | JFK | 0 |
| 188 | 6/2/15 | 637 | JFK | 202 |
| 194 | 6/2/15 | 704 | JFK | -11 |
... (593 rows omitted)
Deterministic
united.sample(3, with_replacement=True)
| Row | Date | Flight Number | Destination | Delay |
|---|---|---|---|---|
| 7063 | 7/18/15 | 329 | DEN | -5 |
| 2302 | 6/16/15 | 759 | EWR | 90 |
| 5593 | 7/8/15 | 1454 | DEN | -7 |
Random
(
united
.where('Destination', 'JFK')
.sample(3, with_replacement=True)
)
| Row | Date | Flight Number | Destination | Delay |
|---|---|---|---|---|
| 2603 | 6/18/15 | 637 | JFK | 18 |
| 8291 | 7/26/15 | 637 | JFK | 61 |
| 3336 | 6/23/15 | 502 | JFK | 40 |
Random
start = np.random.choice(np.arange(1000))
systematic_sample = united.take(np.arange(start, united.num_rows, 5000))
systematic_sample.show()
| Row | Date | Flight Number | Destination | Delay |
|---|---|---|---|---|
| 70 | 6/1/15 | 1124 | LAS | -1 |
| 5070 | 7/5/15 | 599 | BOS | -7 |
| 10070 | 8/6/15 | 1641 | IAD | -2 |
Random
die = Table().with_column('Face', np.arange(1, 7))
die
| Face |
|---|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
What is the Probability Distribution of drawing each face assuming each face is equally likely (a "fair die")?
roll_bins = np.arange(0.5, 6.6, 1)
die.hist(bins=roll_bins)
We can sample from the die table many times with replacement:
die.sample(3)
| Face |
|---|
| 4 |
| 5 |
| 6 |
We can construct an Empirical Distribution from our simulation:
die.sample(10).hist(bins=roll_bins)
If we increase the number of trials in our simulation, what happens to the distribution?
die.sample(100).hist(bins=roll_bins)
die.sample(100_000).hist(bins=roll_bins)
The United flight delays is a relatively large dataset:
united.num_rows
13825
We can plot the distribution of delays for the population:
united.hist('Delay', bins = 50)
There appears to be some very delayed flights!
united.sort('Delay', descending=True)
| Row | Date | Flight Number | Destination | Delay |
|---|---|---|---|---|
| 3140 | 6/21/15 | 1964 | SEA | 580 |
| 3154 | 6/22/15 | 300 | HNL | 537 |
| 3069 | 6/21/15 | 1149 | IAD | 508 |
| 2888 | 6/20/15 | 353 | ORD | 505 |
| 12627 | 8/23/15 | 1589 | ORD | 458 |
| 7949 | 7/23/15 | 1960 | LAX | 438 |
| 3412 | 6/23/15 | 1606 | ORD | 430 |
| 578 | 6/4/15 | 1743 | LAX | 408 |
| 2474 | 6/17/15 | 1122 | HNL | 405 |
| 8426 | 7/27/15 | 572 | ORD | 385 |
... (13815 rows omitted)
Let's truncate the extreme flights with a histogram from -20 to 201. (More on why we do this later.)
united_bins = np.arange(-20, 201, 5)
united.hist('Delay', bins = united_bins)
What happens if we take a small sample from this population of flights and compute the distribution of delays:
united.sample(10).hist('Delay', bins = united_bins)
If we increase the sample size
united.sample(1000).hist('Delay', bins = united_bins)
united.sample(2000).hist('Delay', bins = united_bins)
Because we have access to the population (this is rare!) we can compute the parameters directly from the data. For example, supposed we wanted to know the median flight delay:
np.median(united.column('Delay'))
2.0
In practice, we will often have a sample. The median of the sample is a statistic that estimates the median of the population.
np.median(united.sample(10).column('Delay'))
0.0
But is it a good estimate?
It depends on the sample size (and how close we want it to be). Here we define a function to simulate the process of computing the median from a random sample of a given size:
def sample_median(size):
return np.median(united.sample(size).column('Delay'))
sample_median(10)
-0.5
We can then simulate this sampling process many times:
sample_medians = make_array()
for i in np.arange(1000):
new_median = sample_median(10)
sample_medians = np.append(sample_medians, new_median)
medians = Table().with_columns(
"Sample Medians", sample_medians,
"Sample Size", 10)
medians.hist("Sample Medians", bins = 50)
sample_medians2 = make_array()
for i in np.arange(1000):
new_median = sample_median(1000)
sample_medians2 = np.append(sample_medians2, new_median)
medians.append(Table().with_columns(
"Sample Medians", sample_medians2,
"Sample Size", 1000)).hist("Sample Medians", group="Sample Size", bins=50)