from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
Suppose we wanted to manually compute the 55th percentile of the following array:
x = make_array(43, 20, 51, 7, 28, 34)
Step 1. To compute percentiles we first sort the data
sorted_x = np.sort(x)
sorted_x
array([ 7, 20, 28, 34, 43, 51])
ptbl = Table().with_columns(
"Percentile", 100*(np.arange(0, len(x))+1)/len(x),
"Element", sorted_x)
ptbl
Percentile | Element |
---|---|
16.6667 | 7 |
33.3333 | 20 |
50 | 28 |
66.6667 | 34 |
83.3333 | 43 |
100 | 51 |
Step 2. Figure out where the $p^\text{th}$ percentile would be.
p = 55
ind = int(np.ceil(len(x) * p/100) - 1)
ind
3
sorted_x.item(ind)
34
The above calculation is confusing and brittle (try p=0). Instead, we should use the percentile
function.
percentile?
Signature: percentile(p, arr=None) Docstring: Returns the pth percentile of the input array (the value that is at least as great as p% of the values in the array). If arr is not provided, percentile returns itself curried with p >>> percentile(74.9, [1, 3, 5, 9]) 5 >>> percentile(75, [1, 3, 5, 9]) 5 >>> percentile(75.1, [1, 3, 5, 9]) 9 >>> f = percentile(75) >>> f([1, 3, 5, 9]) 5 File: /opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/util.py Type: function
Recall the precentile table.
ptbl
Percentile | Element |
---|---|
16.6667 | 7 |
33.3333 | 20 |
50 | 28 |
66.6667 | 34 |
83.3333 | 43 |
100 | 51 |
Let's try a few values.
percentile(50, x)
28
percentile(55, x)
34
percentile(0, x)
7
percentile(100, x)
51
s = make_array(1, 3, 5, 7, 9)
Table().with_columns(
"Percentile", 100*(np.arange(0, len(s))+1)/len(s),
"Element", sorted(s))
Percentile | Element |
---|---|
20 | 1 |
40 | 3 |
60 | 5 |
80 | 7 |
100 | 9 |
percentile(10, s) == 0
False
percentile(39, s) == percentile(40, s)
True
percentile(40, s) == percentile(41, s)
False
percentile(50, s) == 5
True
To demonstrate the process of estimating a parameter, let's examine the 2019 San Francisco public records. We obtained this data from the SF Open Data Portal. For the purposes of this exercise, we will assume that this a census of the compensation data: that it contains the compensation for a public workers.
sf = Table.read_table('san_francisco_2019.csv')
sf.show(3)
Organization Group | Department | Job Family | Job | Salary | Overtime | Benefits | Total Compensation |
---|---|---|---|---|---|---|---|
Public Protection | Adult Probation | Information Systems | IS Trainer-Journey | 91332 | 0 | 40059 | 131391 |
Public Protection | Adult Probation | Information Systems | IS Engineer-Assistant | 123241 | 0 | 49279 | 172520 |
Public Protection | Adult Probation | Information Systems | IS Business Analyst-Senior | 115715 | 0 | 46752 | 162468 |
... (44522 rows omitted)
Suppose we are interested in studying "Total Compensation"
. Let's make a histogram of the total compensation.
sf.hist("Total Compensation", bins=100)
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/tables.py:5865: UserWarning: FixedFormatter should only be used together with FixedLocator axis.set_xticklabels(ticks, rotation='vertical')
Who is getting paid the most?
# Who made the most money
sf.sort('Total Compensation', descending=True).show(5)
Organization Group | Department | Job Family | Job | Salary | Overtime | Benefits | Total Compensation |
---|---|---|---|---|---|---|---|
General Administration & Finance | Retirement Services | Administrative & Mgmt (Unrep) | Chief Investment Officer | 577633 | 0 | 146398 | 724031 |
General Administration & Finance | Retirement Services | Unassigned | Managing Director | 483072 | 0 | 134879 | 617951 |
General Administration & Finance | Retirement Services | Unassigned | Managing Director | 482649 | 0 | 134905 | 617554 |
General Administration & Finance | Retirement Services | Unassigned | Managing Director | 451507 | 0 | 120276 | 571784 |
General Administration & Finance | Retirement Services | Unassigned | Managing Director | 449378 | 0 | 120857 | 570235 |
... (44520 rows omitted)
Who is getting paid the least?
sf.sort('Total Compensation').show(10)
Organization Group | Department | Job Family | Job | Salary | Overtime | Benefits | Total Compensation |
---|---|---|---|---|---|---|---|
Public Protection | Adult Probation | Probation & Parole | Deputy Probation Officer | 0 | 0 | 0 | 0 |
Public Protection | Fire Department | Clerical, Secretarial & Steno | Senior Clerk Typist | 0 | 0 | 0 | 0 |
Public Protection | Juvenile Court | Correction & Detention | Counselor, Juvenile Hall PERS | 0 | 0 | 0 | 0 |
Public Protection | Police | Clerical, Secretarial & Steno | Clerk Typist | 0 | 0 | 0 | 0 |
Public Protection | Sheriff | Correction & Detention | Deputy Sheriff | 0 | 0 | 0 | 0 |
Public Works, Transportation & Commerce | Airport Commission | Sub-Professional Engineering | StdntDsgn Train2/Arch/Eng/Plng | 0 | 0 | 0 | 0 |
Public Works, Transportation & Commerce | Airport Commission | Clerical, Secretarial & Steno | Executive Secretary 1 | 0 | 0 | 0 | 0 |
Public Works, Transportation & Commerce | Airport Commission | Payroll, Billing & Accounting | Senior Account Clerk | 0 | 0 | 0 | 0 |
Public Works, Transportation & Commerce | Airport Commission | Housekeeping & Laundry | Custodian | 0 | 0 | 0 | 0 |
Public Works, Transportation & Commerce | Airport Commission | Housekeeping & Laundry | Custodian | 0 | 0 | 0 | 0 |
... (44515 rows omitted)
There is a clear spike around zero! Why?
We will focus on those that worked at least 20 hours at minimum wage for an entire year.
min_salary = 15 * 20 * 50 # $15/hr, 20 hr/wk, 50 weeks
print("Min Salary", min_salary)
sf = sf.where('Salary', are.above(min_salary))
Min Salary 15000
salary_bins = np.arange(min_salary, 500000, 10000)
sf.hist("Total Compensation", bins=salary_bins)
Here we have access to the population so we can compute parameters directly.
For example, suppose we were interested in the median compensation. Then we could compute it directly on our data:
pop_median = percentile(50, sf.column("Total Compensation"))
pop_median
135747.0
In most real-world settings, you won't have access to the population. Instead, you will take a random sample.
Suppose we sample 400 people from our population.
# An Empirical Distribution
our_sample = sf.sample(400, with_replacement=False)
our_sample.hist('Total Compensation', bins=salary_bins)
We can use the sample median (statistic) as an estimate of the parameter value.
# Estimate: Median of a Sample
percentile(50, our_sample.column('Total Compensation'))
134729.0
But in the real world we won't be able to keep going back to the population. How do we generate a new random sample without going back to the population?
If we could get additional samples from the population, how much variability would their be in our estimate of the median?
def generate_sample_median(samp_size):
new_sample = sf.sample(samp_size, with_replacement=False)
return percentile(50, new_sample.column('Total Compensation'))
generate_sample_median(400)
133096.0
Because we have access to the population, we can simulate many samples from the population:
sample_medians = make_array()
for i in np.arange(1000):
new_median = generate_sample_median(400)
sample_medians = np.append(sample_medians, new_median)
med_bins = np.arange(120000, 160000, 1000)
Table().with_column('Sample Medians', sample_medians).hist(bins=med_bins)
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color='red');
What happens if we do the same thing again with slightly larger samples?
sample_medians2 = make_array()
for i in np.arange(1000):
new_median = generate_sample_median(800)
sample_medians2 = np.append(sample_medians2, new_median)
(Table()
.with_columns("Sample Medians", sample_medians,
"Sample Size", 400)
.append(Table().with_columns("Sample Medians", sample_medians2,
"Sample Size", 800))
.hist("Sample Medians", group="Sample Size", bins=med_bins)
)
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color='red');
But in the real world we won't be able to keep going back to the population. How do we generate a new random sample without going back to the population?
Sample randomly
Step 1: Sample the original sample With Replacement the same number of times as the original sample size.
table.sample() # All you need!
The default behavior of tbl.sample:
bootstrap_sample = our_sample.sample()
print("Number of Rows:", bootstrap_sample.num_rows)
Number of Rows: 400
bootstrap_sample.hist('Total Compensation', bins=salary_bins)
Step 2: Compute statistic on bootstrap sample.
percentile(50, bootstrap_sample.column('Total Compensation'))
139711.0
Repeat the sampling process many times:
def one_bootstrap_median():
# draw the bootstrap sample
bootstrap_sample = our_sample.sample()
# return the median total compensation in the bootstrap sample
return percentile(50, bootstrap_sample.column('Total Compensation'))
one_bootstrap_median()
135417.0
# Generate the medians of 1000 bootstrap samples
num_repetitions = 1000
bstrap_medians = make_array()
for i in np.arange(num_repetitions):
bstrap_medians = np.append(bstrap_medians, one_bootstrap_median())
Examine the empirical distribution of the samples.
resampled_medians = Table().with_column('Bootstrap Sample Median', bstrap_medians)
median_bins=np.arange(120000, 160000, 2000)
resampled_medians.hist(bins = median_bins)
# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');
The following function implements the general bootstrap procedure.
def bootstrapper(sample, statistic, num_repetitions):
"""
Returns the statistic computed on a num_repetitions
bootstrap samples from sample.
"""
bstrap_stats = make_array()
for i in np.arange(num_repetitions):
# Step 1: Sample the Sample
bootstrap_sample = sample.sample()
# Step 2: compute statistics on the sample of the sample
bootstrap_stat = statistic(bootstrap_sample)
# Accumulate the statistics
bstrap_stats = np.append(bstrap_stats, bootstrap_stat)
return bstrap_stats
og_sample = sf.sample(400)
def compute_median(sample):
return percentile(50, sample.column("Total Compensation"))
bootstrap_medians = bootstrapper(og_sample, compute_median, 1000)
(Table().with_column("bootstraps", bootstrap_medians)
.hist(bins=median_bins))
## Extra Viz code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');
Computing confidence intervals is as simple as computing percentiles on the bootstrap samples. No magic equations!
left = percentile(2.5, bstrap_medians)
right = percentile(97.5, bstrap_medians)
make_array(left, right)
array([ 128774., 142258.])
resampled_medians.hist(bins = median_bins)
# Plotting parameters; you can ignore this code
plots.ylim(-0.000005, 0.00014)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);