In [1]:

```
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
```

In this lecture we will:

- Simulate the Monty Hall Problem
- Demonstrate Deterministic and Random Sampling
- Probability Distributions and Empirical Distributions
- Law of Large Numbers

Here we simulate the Monty Hall problem. We break the process into three steps.

- Simulate the prize behind the door we picked (this is the only chance event):

In [2]:

```
prizes = make_array("goat", "goat", "car")
```

In [3]:

```
N = 10_000
outcomes = Table().with_column("My Choice", np.random.choice(prizes, N))
outcomes
```

Out[3]:

My Choice |
---|

goat |

car |

goat |

goat |

goat |

car |

goat |

goat |

goat |

goat |

... (9990 rows omitted)

- Then Monty Hall reveals a Goat behind one of the other doors.

In [4]:

```
outcomes = outcomes.with_column("Monty's Door", "goat")
outcomes
```

Out[4]:

My Choice | Monty's Door |
---|---|

goat | goat |

car | goat |

goat | goat |

goat | goat |

goat | goat |

car | goat |

goat | goat |

goat | goat |

goat | goat |

goat | goat |

... (9990 rows omitted)

In [5]:

```
def other_door(my_choice):
if my_choice == "car":
return "goat"
else:
return "car"
```

In [6]:

```
outcomes = outcomes.with_column(
"Other Door", outcomes.apply(other_door, "My Choice"))
outcomes
```

Out[6]:

My Choice | Monty's Door | Other Door |
---|---|---|

goat | goat | car |

car | goat | goat |

goat | goat | car |

goat | goat | car |

goat | goat | car |

car | goat | goat |

goat | goat | car |

goat | goat | car |

goat | goat | car |

goat | goat | car |

... (9990 rows omitted)

If we stayed with our initial choice how often would we get a car?

In [7]:

```
outcomes.group("My Choice").barh("My Choice")
```

If we switched to the Other door how often would we win?

In [8]:

```
outcomes.group("Other Door").barh("Other Door")
```

Would you switch?

In [9]:

```
united = Table.read_table('united.csv')
united = ( # Adding row numbers so we can see samples more easily
united
.with_column('Row', np.arange(united.num_rows))
.move_to_start('Row')
)
united
```

Out[9]:

Row | Date | Flight Number | Destination | Delay |
---|---|---|---|---|

0 | 6/1/15 | 73 | HNL | 257 |

1 | 6/1/15 | 217 | EWR | 28 |

2 | 6/1/15 | 237 | STL | -3 |

3 | 6/1/15 | 250 | SAN | 0 |

4 | 6/1/15 | 267 | PHL | 64 |

5 | 6/1/15 | 273 | SEA | -6 |

6 | 6/1/15 | 278 | SEA | -8 |

7 | 6/1/15 | 292 | EWR | 12 |

8 | 6/1/15 | 300 | HNL | 20 |

9 | 6/1/15 | 317 | IND | -10 |

... (13815 rows omitted)

For each of the following, is this a deterministic or random sampling strategy?

In [10]:

```
united.where('Destination', 'JFK')
```

Out[10]:

Row | Date | Flight Number | Destination | Delay |
---|---|---|---|---|

26 | 6/1/15 | 502 | JFK | -4 |

33 | 6/1/15 | 637 | JFK | 141 |

39 | 6/1/15 | 704 | JFK | -8 |

50 | 6/1/15 | 758 | JFK | -5 |

51 | 6/1/15 | 760 | JFK | 352 |

56 | 6/1/15 | 824 | JFK | 3 |

57 | 6/1/15 | 898 | JFK | 290 |

179 | 6/2/15 | 502 | JFK | 0 |

188 | 6/2/15 | 637 | JFK | 202 |

194 | 6/2/15 | 704 | JFK | -11 |

... (593 rows omitted)

**Deterministic**

In [11]:

```
united.sample(3, with_replacement=True)
```

Out[11]:

Row | Date | Flight Number | Destination | Delay |
---|---|---|---|---|

7063 | 7/18/15 | 329 | DEN | -5 |

2302 | 6/16/15 | 759 | EWR | 90 |

5593 | 7/8/15 | 1454 | DEN | -7 |

**Random**

In [12]:

```
(
united
.where('Destination', 'JFK')
.sample(3, with_replacement=True)
)
```

Out[12]:

Row | Date | Flight Number | Destination | Delay |
---|---|---|---|---|

2603 | 6/18/15 | 637 | JFK | 18 |

8291 | 7/26/15 | 637 | JFK | 61 |

3336 | 6/23/15 | 502 | JFK | 40 |

**Random**

In [13]:

```
start = np.random.choice(np.arange(1000))
systematic_sample = united.take(np.arange(start, united.num_rows, 5000))
systematic_sample.show()
```

Row | Date | Flight Number | Destination | Delay |
---|---|---|---|---|

70 | 6/1/15 | 1124 | LAS | -1 |

5070 | 7/5/15 | 599 | BOS | -7 |

10070 | 8/6/15 | 1641 | IAD | -2 |

**Random**

In [14]:

```
die = Table().with_column('Face', np.arange(1, 7))
die
```

Out[14]:

Face |
---|

1 |

2 |

3 |

4 |

5 |

6 |

**Probability Distribution** of drawing each face assuming each face is equally likely (a "fair die")?

In [15]:

```
roll_bins = np.arange(0.5, 6.6, 1)
die.hist(bins=roll_bins)
```

We can sample from the die table many times with replacement:

In [16]:

```
die.sample(3)
```

Out[16]:

Face |
---|

4 |

5 |

6 |

We can construct an **Empirical Distribution** from our simulation:

In [17]:

```
die.sample(10).hist(bins=roll_bins)
```

If we increase the number of trials in our simulation, what happens to the distribution?

In [18]:

```
die.sample(100).hist(bins=roll_bins)
```

In [19]:

```
die.sample(100_000).hist(bins=roll_bins)
```

The United flight delays is a relatively large dataset:

In [20]:

```
united.num_rows
```

Out[20]:

13825

We can plot the distribution of delays for the population:

In [21]:

```
united.hist('Delay', bins = 50)
```

There appears to be some very delayed flights!

In [22]:

```
united.sort('Delay', descending=True)
```

Out[22]:

Row | Date | Flight Number | Destination | Delay |
---|---|---|---|---|

3140 | 6/21/15 | 1964 | SEA | 580 |

3154 | 6/22/15 | 300 | HNL | 537 |

3069 | 6/21/15 | 1149 | IAD | 508 |

2888 | 6/20/15 | 353 | ORD | 505 |

12627 | 8/23/15 | 1589 | ORD | 458 |

7949 | 7/23/15 | 1960 | LAX | 438 |

3412 | 6/23/15 | 1606 | ORD | 430 |

578 | 6/4/15 | 1743 | LAX | 408 |

2474 | 6/17/15 | 1122 | HNL | 405 |

8426 | 7/27/15 | 572 | ORD | 385 |

... (13815 rows omitted)

Let's truncate the extreme flights with a histogram from -20 to 201. (More on why we do this later.)

In [23]:

```
united_bins = np.arange(-20, 201, 5)
united.hist('Delay', bins = united_bins)
```

In [24]:

```
united.sample(10).hist('Delay', bins = united_bins)
```

If we increase the sample size

In [25]:

```
united.sample(1000).hist('Delay', bins = united_bins)
```

In [26]:

```
united.sample(2000).hist('Delay', bins = united_bins)
```

In [27]:

```
np.median(united.column('Delay'))
```

Out[27]:

2.0

In [28]:

```
np.median(united.sample(10).column('Delay'))
```

Out[28]:

0.0

But is it a good estimate?

In [29]:

```
def sample_median(size):
return np.median(united.sample(size).column('Delay'))
```

In [30]:

```
sample_median(10)
```

Out[30]:

-0.5

We can then simulate this sampling process many times:

In [31]:

```
sample_medians = make_array()
for i in np.arange(1000):
new_median = sample_median(10)
sample_medians = np.append(sample_medians, new_median)
```

In [32]:

```
medians = Table().with_columns(
"Sample Medians", sample_medians,
"Sample Size", 10)
medians.hist("Sample Medians", bins = 50)
```

In [33]:

```
sample_medians2 = make_array()
for i in np.arange(1000):
new_median = sample_median(1000)
sample_medians2 = np.append(sample_medians2, new_median)
```

In [34]:

```
medians.append(Table().with_columns(
"Sample Medians", sample_medians2,
"Sample Size", 1000)).hist("Sample Medians", group="Sample Size", bins=50)
```