In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Lecture 23¶

Percentiles¶

Suppose we wanted to manually compute the 55th percentile of the following array:

In [2]:
x = make_array(43, 20, 51, 7, 28, 34)

Step 1. To compute percentiles we first sort the data

In [3]:
sorted_x = np.sort(x)
sorted_x
Out[3]:
array([ 7, 20, 28, 34, 43, 51])
In [4]:
ptbl = Table().with_columns(
    "Percentile", 100*(np.arange(0, len(x))+1)/len(x),
    "Element", sorted_x)
ptbl
Out[4]:
Percentile Element
16.6667 7
33.3333 20
50 28
66.6667 34
83.3333 43
100 51

Step 2. Figure out where the $p^\text{th}$ percentile would be.

In [5]:
p = 55
ind = int(np.ceil(len(x) * p/100) - 1)
ind
Out[5]:
3
In [6]:
sorted_x.item(ind)
Out[6]:
34

The above calculation is confusing and brittle (try p=0). Instead, we should use the percentile function.

Using the Percentile Function¶

In [7]:
percentile?
Signature: percentile(p, arr=None)
Docstring:
Returns the pth percentile of the input array (the value that is at
least as great as p% of the values in the array).

If arr is not provided, percentile returns itself curried with p

>>> percentile(74.9, [1, 3, 5, 9])
5
>>> percentile(75, [1, 3, 5, 9])
5
>>> percentile(75.1, [1, 3, 5, 9])
9
>>> f = percentile(75)
>>> f([1, 3, 5, 9])
5
File:      /opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/util.py
Type:      function

Recall the precentile table.

In [8]:
ptbl
Out[8]:
Percentile Element
16.6667 7
33.3333 20
50 28
66.6667 34
83.3333 43
100 51

Let's try a few values.

In [9]:
percentile(50, x)
Out[9]:
28
In [10]:
percentile(55, x)
Out[10]:
34
In [11]:
percentile(0, x)
Out[11]:
7
In [12]:
percentile(100, x)
Out[12]:
51






Return to Slides





Discussion Question¶

In [13]:
s = make_array(1, 3, 5, 7, 9)
In [14]:
Table().with_columns(
    "Percentile", 100*(np.arange(0, len(s))+1)/len(s),
    "Element", sorted(s))
Out[14]:
Percentile Element
20 1
40 3
60 5
80 7
100 9
In [15]:
percentile(10, s) == 0
Out[15]:
False
In [16]:
percentile(39, s) == percentile(40, s)
Out[16]:
True
In [17]:
percentile(40, s) == percentile(41, s)
Out[17]:
False
In [18]:
percentile(50, s) == 5
Out[18]:
True






Return to Slides





Inference: Estimation¶

To demonstrate the process of estimating a parameter, let's examine the 2019 San Francisco public records. We obtained this data from the SF Open Data Portal. For the purposes of this exercise, we will assume that this a census of the compensation data: that it contains the compensation for a public workers.

In [19]:
sf = Table.read_table('san_francisco_2019.csv')
sf.show(3)
Organization Group Department Job Family Job Salary Overtime Benefits Total Compensation
Public Protection Adult Probation Information Systems IS Trainer-Journey 91332 0 40059 131391
Public Protection Adult Probation Information Systems IS Engineer-Assistant 123241 0 49279 172520
Public Protection Adult Probation Information Systems IS Business Analyst-Senior 115715 0 46752 162468

... (44522 rows omitted)

Suppose we are interested in studying "Total Compensation". Let's make a histogram of the total compensation.

In [20]:
sf.hist("Total Compensation", bins=100)
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/tables.py:5865: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(ticks, rotation='vertical')

Who is getting paid the most?

In [21]:
# Who made the most money
sf.sort('Total Compensation', descending=True).show(5)
Organization Group Department Job Family Job Salary Overtime Benefits Total Compensation
General Administration & Finance Retirement Services Administrative & Mgmt (Unrep) Chief Investment Officer 577633 0 146398 724031
General Administration & Finance Retirement Services Unassigned Managing Director 483072 0 134879 617951
General Administration & Finance Retirement Services Unassigned Managing Director 482649 0 134905 617554
General Administration & Finance Retirement Services Unassigned Managing Director 451507 0 120276 571784
General Administration & Finance Retirement Services Unassigned Managing Director 449378 0 120857 570235

... (44520 rows omitted)

Who is getting paid the least?

In [22]:
sf.sort('Total Compensation').show(10)
Organization Group Department Job Family Job Salary Overtime Benefits Total Compensation
Public Protection Adult Probation Probation & Parole Deputy Probation Officer 0 0 0 0
Public Protection Fire Department Clerical, Secretarial & Steno Senior Clerk Typist 0 0 0 0
Public Protection Juvenile Court Correction & Detention Counselor, Juvenile Hall PERS 0 0 0 0
Public Protection Police Clerical, Secretarial & Steno Clerk Typist 0 0 0 0
Public Protection Sheriff Correction & Detention Deputy Sheriff 0 0 0 0
Public Works, Transportation & Commerce Airport Commission Sub-Professional Engineering StdntDsgn Train2/Arch/Eng/Plng 0 0 0 0
Public Works, Transportation & Commerce Airport Commission Clerical, Secretarial & Steno Executive Secretary 1 0 0 0 0
Public Works, Transportation & Commerce Airport Commission Payroll, Billing & Accounting Senior Account Clerk 0 0 0 0
Public Works, Transportation & Commerce Airport Commission Housekeeping & Laundry Custodian 0 0 0 0
Public Works, Transportation & Commerce Airport Commission Housekeeping & Laundry Custodian 0 0 0 0

... (44515 rows omitted)

There is a clear spike around zero! Why?



We will focus on those that worked at least 20 hours at minimum wage for an entire year.

In [23]:
min_salary = 15 * 20 * 50 # $15/hr, 20 hr/wk, 50 weeks
print("Min Salary", min_salary)
sf = sf.where('Salary', are.above(min_salary))
Min Salary 15000
In [24]:
salary_bins = np.arange(min_salary, 500000, 10000)
sf.hist("Total Compensation", bins=salary_bins)

The Population Parameter¶

Here we have access to the population so we can compute parameters directly.

For example, suppose we were interested in the median compensation. Then we could compute it directly on our data:

In [25]:
pop_median = percentile(50, sf.column("Total Compensation"))
pop_median
Out[25]:
135747.0

In most real-world settings, you won't have access to the population. Instead, you will take a random sample.

Suppose we sample 400 people from our population.

In [26]:
# An Empirical Distribution
our_sample = sf.sample(400, with_replacement=False)
our_sample.hist('Total Compensation', bins=salary_bins)

We can use the sample median (statistic) as an estimate of the parameter value.

In [27]:
# Estimate: Median of a Sample
percentile(50, our_sample.column('Total Compensation'))
Out[27]:
134729.0

But in the real world we won't be able to keep going back to the population. How do we generate a new random sample without going back to the population?






Return to Slides





Variability of the Estimate¶

If we could get additional samples from the population, how much variability would their be in our estimate of the median?

In [28]:
def generate_sample_median(samp_size):
    new_sample = sf.sample(samp_size, with_replacement=False)
    return percentile(50, new_sample.column('Total Compensation'))
In [29]:
generate_sample_median(400)
Out[29]:
133096.0

Quantifying Uncertainty¶

Because we have access to the population, we can simulate many samples from the population:

In [30]:
sample_medians = make_array()

for i in np.arange(1000):
    new_median = generate_sample_median(400)
    sample_medians = np.append(sample_medians, new_median)
In [31]:
med_bins = np.arange(120000, 160000, 1000)
Table().with_column('Sample Medians', sample_medians).hist(bins=med_bins)

plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color='red');

What happens if we do the same thing again with slightly larger samples?

In [32]:
sample_medians2 = make_array()

for i in np.arange(1000):
    new_median = generate_sample_median(800)
    sample_medians2 = np.append(sample_medians2, new_median)
In [33]:
(Table()
     .with_columns("Sample Medians", sample_medians,
                   "Sample Size", 400)
     .append(Table().with_columns("Sample Medians", sample_medians2,
                                  "Sample Size", 800))
     .hist("Sample Medians", group="Sample Size", bins=med_bins)
)
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color='red');

But in the real world we won't be able to keep going back to the population. How do we generate a new random sample without going back to the population?






Return to Slides





Bootstrap¶

Sample randomly

  • from the original sample
  • with replacement
  • the same number of times as the original sample size




Step 1: Sample the original sample With Replacement the same number of times as the original sample size.

table.sample() # All you need!

The default behavior of tbl.sample:

  1. at random with replacement,
  2. the same number of times as rows of tbl
In [34]:
bootstrap_sample = our_sample.sample()
print("Number of Rows:", bootstrap_sample.num_rows)
Number of Rows: 400
In [35]:
bootstrap_sample.hist('Total Compensation', bins=salary_bins)

Step 2: Compute statistic on bootstrap sample.

In [36]:
percentile(50, bootstrap_sample.column('Total Compensation'))
Out[36]:
139711.0

Repeat the sampling process many times:

In [37]:
def one_bootstrap_median():
    # draw the bootstrap sample
    bootstrap_sample = our_sample.sample()
    # return the median total compensation in the bootstrap sample
    return percentile(50, bootstrap_sample.column('Total Compensation'))
In [38]:
one_bootstrap_median()
Out[38]:
135417.0
In [39]:
# Generate the medians of 1000 bootstrap samples
num_repetitions = 1000
bstrap_medians = make_array()
for i in np.arange(num_repetitions):
    bstrap_medians = np.append(bstrap_medians, one_bootstrap_median())

Examine the empirical distribution of the samples.

In [40]:
resampled_medians = Table().with_column('Bootstrap Sample Median', bstrap_medians)
median_bins=np.arange(120000, 160000, 2000)
resampled_medians.hist(bins = median_bins)

# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');

A General Bootstrap Function¶

The following function implements the general bootstrap procedure.

In [41]:
def bootstrapper(sample, statistic, num_repetitions):
    """
    Returns the statistic computed on a num_repetitions  
    bootstrap samples from sample.
    """
    bstrap_stats = make_array()
    for i in np.arange(num_repetitions):
        # Step 1: Sample the Sample
        bootstrap_sample = sample.sample()
        # Step 2: compute statistics on the sample of the sample
        bootstrap_stat = statistic(bootstrap_sample)
        # Accumulate the statistics
        bstrap_stats = np.append(bstrap_stats, bootstrap_stat)

    return bstrap_stats    
In [42]:
og_sample = sf.sample(400)

def compute_median(sample):
    return percentile(50, sample.column("Total Compensation"))

bootstrap_medians = bootstrapper(og_sample, compute_median, 1000)
In [43]:
(Table().with_column("bootstraps", bootstrap_medians)
        .hist(bins=median_bins))

## Extra Viz code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');






Return to Slides





Percentile Method: Middle 95% of the Bootstrap Estimates¶

Computing confidence intervals is as simple as computing percentiles on the bootstrap samples. No magic equations!

In [44]:
left = percentile(2.5, bstrap_medians)
right = percentile(97.5, bstrap_medians)

make_array(left, right)
Out[44]:
array([ 128774.,  142258.])
In [45]:
resampled_medians.hist(bins = median_bins)

# Plotting parameters; you can ignore this code
plots.ylim(-0.000005, 0.00014)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);
In [ ]: