from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
Recall, we actually had access to the population data. This is not normally the case. The methods we are about to study only make sense when you don't have access to the population.
sf = Table.read_table('san_francisco_2019.csv')
min_salary = 15 * 20 * 50
sf = sf.where('Salary', are.above(min_salary))
sf.num_rows
37103
sf_bins = np.arange(0, 726000, 25000)
sf.hist('Total Compensation', bins=sf_bins)
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/tables.py:5865: UserWarning: FixedFormatter should only be used together with FixedLocator axis.set_xticklabels(ticks, rotation='vertical')
Here we are interested in the median (50% percentile) of the total compensation.
# Parameter: Median total compensation in the population
def median_comp(t):
return percentile(50, t.column('Total Compensation'))
We have access to the population so we can compute the parameter but in practice we typically won't have access to the population and instead we will have to estimate the parameter from a sample.
pop_median = median_comp(sf)
print("Parameter Value:", pop_median)
Parameter Value: 135747.0
In practice, the data we would get from most studies would be a sample of the entire population. This is because collecting a full census is challenging and expensive.
Here we load the (fictional) sample that was taken by Prof. Gonzalez.
He collected this sample by selecting 400 names at random from the HR database and then directly asking them how much they make. It took all weekend and he had to give a chocolate to everyone who participated.
Actually, everything above is a complete fabrication. Prof. Gonzalez simply wrote:
sf.sample(400, with_replacement=False).to_csv("sf_sample.csv")
sample_sf = Table.read_table("sf_sample.csv")
sample_sf.show(4)
Organization Group | Department | Job Family | Job | Salary | Overtime | Benefits | Total Compensation |
---|---|---|---|---|---|---|---|
Community Health | Public Health | Human Services | Hospital Eligibility Worker | 75854 | 2661 | 35780 | 114296 |
Community Health | Public Health | Nursing | Nurse Practitioner | 236867 | 0 | 77694 | 314562 |
Public Protection | Police | Police Services | Police Officer | 124906 | 12346 | 41706 | 178957 |
Public Works, Transportation & Commerce | Department Of Public Works | Skilled Labor | Hodcarrier | 88812 | 589 | 40995 | 130396 |
... (396 rows omitted)
sample_sf.num_rows
400
We introduced the following function that computes bootstrap samples of the statistic
def bootstrapper(sample, statistic, num_repetitions):
"""
Returns the statistic computed on a num_repetitions
bootstrap samples from sample.
"""
bstrap_stats = make_array()
for i in np.arange(num_repetitions):
# Step 1: Sample the Sample
bootstrap_sample = sample.sample()
# Step 2: compute statistics on the sample of the sample
bootstrap_stat = statistic(bootstrap_sample)
# Accumulate the statistics
bstrap_stats = np.append(bstrap_stats, bootstrap_stat)
return bstrap_stats
bootstrap_medians = bootstrapper(sample_sf, median_comp, 10000)
When using the bootstrap it is important to always examine the distribution of the statistic to ensure that it is symmetric and "bell shaped"
Table().with_column("bootstraps", bootstrap_medians).hist()
## Extra Viz code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/tables.py:5865: UserWarning: FixedFormatter should only be used together with FixedLocator axis.set_xticklabels(ticks, rotation='vertical')
We compute the confidence interval for our desired *confidence level using the following code:
def compute_ci(bs_samples, confidence_level):
"""
Returns the confidence interval for the provided bootstrap samples
and desired confidence level.
"""
tail_size = (100 - confidence_level)/2
lower = percentile(tail_size, bs_samples)
upper = percentile(100 - tail_size, bs_samples)
return [lower, upper]
ci = compute_ci(bootstrap_medians, 95)
ci
[130580.0, 142054.0]
Visualizing the CI
Table().with_column("bootstraps", bootstrap_medians).hist()
# cool python trick to deconstruct the array!
[left, right] = ci
# Plotting parameters; you can ignore this code
plots.ylim(-0.000005, 0.00014)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);
Return to Slides
Let's look at what happens if we use the above code repeatedly with separate original samples from the population. How accurate are our Bootstrap Estimates of the Parameter?
intervals = Table(['lower', 'upper', 'median', 'good', 'sample_size'])
sample_sizes = [2,4,8,16]
for sample_size in sample_sizes:
for trial in np.arange(20): # Run 20 trials of each configuration
# Pay for one new random sample from the population
og_sample = sf.sample(sample_size, with_replacement=False)
# Compute the statistic on the sample
sample_median = median_comp(og_sample)
# Generate the medians of 1000 bootstrap samples
bootstrap_medians = bootstrapper(og_sample, median_comp, 1000)
# Construct the confidence interval
[ci_lower, ci_upper] = compute_ci(bootstrap_medians, 95)
# Determine if the confidence interval is good
is_good = ci_lower <= pop_median <= ci_upper
# Add a row to the table
intervals = intervals.with_row(
[ci_lower, ci_upper, sample_median, is_good, str(sample_size)])
# Add an experiment number
intervals = intervals.with_column("Experiment", np.arange(intervals.num_rows))
intervals
lower | upper | median | good | sample_size | Experiment |
---|---|---|---|---|---|
148779 | 188554 | 148779 | False | 2 | 0 |
210940 | 234990 | 210940 | False | 2 | 1 |
107907 | 180652 | 107907 | True | 2 | 2 |
121077 | 137292 | 121077 | True | 2 | 3 |
25252 | 59568 | 25252 | False | 2 | 4 |
105799 | 113954 | 105799 | False | 2 | 5 |
20421 | 26208 | 20421 | False | 2 | 6 |
105664 | 123580 | 105664 | False | 2 | 7 |
137439 | 147847 | 137439 | False | 2 | 8 |
79425 | 138245 | 79425 | True | 2 | 9 |
... (70 rows omitted)
Here I render a plot of all the confidence intervals with the parameter value depicted as a solid vertical line. This is advanced plotting code and is out of scope for this class. But it might be fun to try to understand what it is doing.
import plotly.express as px
# Plotly will draw error bars which are the distance in each direction
# from the median
intervals = intervals.with_columns(
"error_left", intervals.column("median") - intervals.column("lower"),
"error_right", intervals.column("upper") - intervals.column("median"))
# Making the plot
fig = px.scatter(
intervals.to_df(), # Plotly requires a pandas dafaframe
x="median", # X location of my interval center dot
y="Experiment", # Y location of my interval center dot
color="sample_size", # The color to use.
symbol="good", # The column to use for the symbol
symbol_map={True: "circle", False: "circle-open"},
error_x="error_right", # upper error bar size
error_x_minus="error_left", # lower error bar size
height=800)
fig.add_vline(pop_median)