from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action='ignore',category=np.VisibleDeprecationWarning)
In this lecture, we will:
Lists in python are like arrays but they can contain things of different types and you make them by using square brackets:
[1, 5, 'hello', 5.0]
[1, 5, 'hello', 5.0]
another_list = [1, 5, 'hello', 5.0, make_array(1,2,3)]
another_list
[1, 5, 'hello', 5.0, array([1, 2, 3])]
You have to use square brakets when accessing elements from a list (you can't use item).
another_list[2]
'hello'
Recall that for arrays you could do either. (In the real world, most people would probably use square brackets).
my_array = make_array("Apple", "Banana", "Corn")
my_array[1]
'Banana'
Same thing as:
my_array.item(1)
'Banana'
Here we make a table with a column (we have been doing this all semester).
Table().with_columns('Numbers', [1, 2, 3])
Numbers |
---|
1 |
2 |
3 |
Here we make an empty table with just column labels. (This is new!)
drinks = Table(['Drink', 'Cafe', 'Price'])
drinks
Drink | Cafe | Price |
---|
Here we add rows to our table. Each row is a List
of elements -- one for each column.
drinks = drinks.with_rows([
['Milk Tea', 'Asha', 5.5],
['Espresso', 'Strada', 1.75],
['Latte', 'Strada', 3.25],
['Espresso', "FSM", 2]
])
drinks
Drink | Cafe | Price |
---|---|---|
Milk Tea | Asha | 5.5 |
Espresso | Strada | 1.75 |
Latte | Strada | 3.25 |
Espresso | FSM | 2 |
Now let's use the group function to study you:
survey = Table.read_table("welcome_survey_sp23.csv")
survey.show(3)
Year | Extraversion | Number of textees | Hours of sleep | Handedness | Pant leg | Sleep position | Pets | Piercings |
---|---|---|---|---|---|---|---|---|
Second Year | 2 | 5 | 9 | Right-handed | Right leg in first | On your right side | Cat, Dog, Fish, Snake, Lizard | -3 |
First Year | 2 | 3 | 8 | Right-handed | I don't know | On your back | None | -1 |
First Year | 5 | 5 | 8 | Right-handed | Right leg in first | On your left side | Bearded dragon | 0 |
... (1498 rows omitted)
Exercise: What happens if we group by "Year"
with no additional arguments?
survey.group("Year")
Year | count |
---|---|
Fifth Year or above | 9 |
First Year | 606 |
Fourth Year | 109 |
Graduate Student | 7 |
Second Year | 468 |
Third Year | 302 |
What is the relationship between year and the average amount of sleep?
Exercise: Computer the average "Hours of sleep"
for each "Year"
and display the result as bar chart with the bars in a meaningful order.
(
survey
.select("Year", "Hours of sleep")
.group("Year", np.average)
.take(1,4,5,2,0,3)
.barh("Year")
)
# plots.xlim([6,7.5])
Click for Solution
(
survey
.select("Year", "Hours of sleep")
.group("Year", np.average)
.take(1,4,5,2,0,3)
.barh("Year")
)
plots.xlim([6,7.5])
survey.hist('Extraversion')
Exercise: Compute the average "Number of textees"
, "Hours of sleep"
, "Piercings"
for each value of "Extraversion"
. Save the resulting table as by_extra
.
by_extra = (
survey
.select("Extraversion", "Number of textees", "Hours of sleep", "Piercings")
.group("Extraversion", np.mean)
)
by_extra
Extraversion | Number of textees mean | Hours of sleep mean | Piercings mean |
---|---|---|---|
1 | 3.44828 | 6.60345 | 1.72414 |
2 | 4.08197 | 6.98361 | 1.13115 |
3 | 4.75263 | 7.23158 | 1.41053 |
4 | 5.14953 | 7.11449 | 1.49533 |
5 | 6.14286 | 7.07576 | 1.64502 |
6 | 6.74257 | 6.98267 | 1.87129 |
7 | 10.989 | 7.09158 | 3.66301e+06 |
8 | 9.7027 | 7.22297 | 2.14414 |
9 | 11.9608 | 7.29412 | 2.41176 |
10 | 9.25 | 6.78571 | 1.89286 |
Click for Solution
by_extra = (
survey
.select("Extraversion", "Number of textees", "Hours of sleep", "Piercings")
.group('Extraversion', np.mean)
)
by_extra
Do you see any trends? Let's plot the data to make it clearer.
by_extra.plot("Extraversion", "Number of textees mean", marker="*")
by_extra.plot("Extraversion", "Hours of sleep mean", marker="*")
Do right-handed people tend to sleep on their left side and left-handed people sleep on their right side?
Exercise: Use the group function to count how many people are in each combination of "Handedness"
and "Sleep position"
.
(
survey
.group(["Handedness", "Sleep position"])
.show()
)
Handedness | Sleep position | count |
---|---|---|
Ambidextrous | On your back | 2 |
Ambidextrous | On your left side | 4 |
Ambidextrous | On your right side | 3 |
Ambidextrous | On your stomach | 3 |
Left-handed | On your back | 27 |
Left-handed | On your left side | 40 |
Left-handed | On your right side | 46 |
Left-handed | On your stomach | 9 |
Right-handed | On your back | 334 |
Right-handed | On your left side | 395 |
Right-handed | On your right side | 475 |
Right-handed | On your stomach | 163 |
Click for Solution
(
survey
.group(["Handedness", "Sleep position"])
.show()
)
What combination of "Handedness"
and "Sleep position"
results in the most sleep on average?
Exercise: Use group to answer the above question.
(
survey
.select("Handedness", "Sleep position", "Hours of sleep")
.group(["Handedness", "Sleep position"], np.average)
.show()
)
Handedness | Sleep position | Hours of sleep average |
---|---|---|
Ambidextrous | On your back | 7 |
Ambidextrous | On your left side | 7.375 |
Ambidextrous | On your right side | 7.33333 |
Ambidextrous | On your stomach | 9 |
Left-handed | On your back | 7 |
Left-handed | On your left side | 7.2125 |
Left-handed | On your right side | 7.30435 |
Left-handed | On your stomach | 6.72222 |
Right-handed | On your back | 7.13323 |
Right-handed | On your left side | 7.05823 |
Right-handed | On your right side | 7.03579 |
Right-handed | On your stomach | 7.24847 |
Click for Solution
(
survey
.select("Handedness", "Sleep position", "Hours of sleep")
.group(['Handedness', 'Sleep position'], np.average)
.show()
)
When grouping by two columns it will sometimes be easier to re-arrange the table so the columns correspond to the values of one of the original grouped columns.
Here is an example:
(
survey
.pivot("Handedness", "Sleep position", "Hours of sleep", np.average)
)
Sleep position | Ambidextrous | Left-handed | Right-handed |
---|---|---|---|
On your back | 7 | 7 | 7.13323 |
On your left side | 7.375 | 7.2125 | 7.05823 |
On your right side | 7.33333 | 7.30435 | 7.03579 |
On your stomach | 9 | 6.72222 | 7.24847 |
Here I am rewriting the example by explicity naming the function arguments. You wouldn't do this in practice but it helps to highlight the meaning of the four things we passed to the function above.
(
survey
.pivot(
columns = "Handedness", # The column whose unique values become columns
rows = "Sleep position", # The column whose unique values become rows
values = "Hours of sleep", # The column that will be used to compute the cells
collect = np.average # How to aggregate the values
)
)
Sleep position | Ambidextrous | Left-handed | Right-handed |
---|---|---|---|
On your back | 7 | 7 | 7.13323 |
On your left side | 7.375 | 7.2125 | 7.05823 |
On your right side | 7.33333 | 7.30435 | 7.03579 |
On your stomach | 9 | 6.72222 | 7.24847 |
Just as with the regular group
function you will often want counts of combinations. In this case, you only need to specify the column to turn into columns and the column to turn into rows.
survey.pivot("Sleep position", "Handedness")
Handedness | On your back | On your left side | On your right side | On your stomach |
---|---|---|---|---|
Ambidextrous | 2 | 4 | 3 | 3 |
Left-handed | 27 | 40 | 46 | 9 |
Right-handed | 334 | 395 | 475 | 163 |
Pivots can be useful for visualizations:
survey.pivot("Sleep position", "Handedness").barh("Handedness")
We use join to join two tables on a column.
drinks
Drink | Cafe | Price |
---|---|---|
Milk Tea | Asha | 5.5 |
Espresso | Strada | 1.75 |
Latte | Strada | 3.25 |
Espresso | FSM | 2 |
discounts = Table().with_columns(
'Coupon % off', make_array(10, 25, 5),
'Location', make_array('Asha', 'Strada', 'Asha')
)
discounts
Coupon % off | Location |
---|---|
10 | Asha |
25 | Strada |
5 | Asha |
Exercise: Combine the drinks
and discounts
table into a new table which has all possible discounts for each drink.
combined = drinks.join('Cafe', discounts, 'Location')
combined
Cafe | Drink | Price | Coupon % off |
---|---|---|---|
Asha | Milk Tea | 5.5 | 10 |
Asha | Milk Tea | 5.5 | 5 |
Strada | Espresso | 1.75 | 25 |
Strada | Latte | 3.25 | 25 |
Click for Solution
combined = drinks.join('Cafe', discounts, 'Location')
combined
Exercise: Calculate the lowest rate you can get for each drink using the available coupons.
discount_frac = 1-combined.column("Coupon % off")/100.0
(
combined
.with_column("Discounted Price", combined.column("Price") * discount_frac)
.select("Drink", "Discounted Price")
.group("Drink", min)
)
Drink | Discounted Price min |
---|---|
Espresso | 1.3125 |
Latte | 2.4375 |
Milk Tea | 4.95 |
Click for Solution
combined = drinks.join('Cafe', discounts, 'Location') # from previous question
discount_frac = 1-combined.column("Coupon % off")/100.0
(
combined
.with_column("Discounted Price", combined.column("Price") * discount_frac)
.select("Drink", "Discounted Price")
.group("Drink", min)
)
Exercise: What happens if I run the following? How many rows will it produce.
drinks.join('Cafe', drinks, 'Cafe')
Cafe | Drink | Price | Drink_2 | Price_2 |
---|---|---|---|---|
Asha | Milk Tea | 5.5 | Milk Tea | 5.5 |
FSM | Espresso | 2 | Espresso | 2 |
Strada | Espresso | 1.75 | Espresso | 1.75 |
Strada | Espresso | 1.75 | Latte | 3.25 |
Strada | Latte | 3.25 | Espresso | 1.75 |
Strada | Latte | 3.25 | Latte | 3.25 |
For the following practice questions we will use the sky scraper dataset
# From the CORGIS Dataset Project
# By Austin Cory Bart acbart@vt.edu
# Version 2.0.0, created 3/22/2016
# https://corgis-edu.github.io/corgis/csv/skyscrapers/
this_year = 2023
sky = Table.read_table("skyscrapers.csv")
sky = (
sky
.with_column("age", this_year - sky.column("completed"))
.drop("completed")
)
sky
name | material | city | height | age |
---|---|---|---|---|
One World Trade Center | mixed/composite | New York City | 541.3 | 9 |
Willis Tower | steel | Chicago | 442.14 | 49 |
432 Park Avenue | concrete | New York City | 425.5 | 8 |
Trump International Hotel & Tower | concrete | Chicago | 423.22 | 14 |
Empire State Building | steel | New York City | 381 | 92 |
Bank of America Tower | mixed/composite | New York City | 365.8 | 14 |
Stratosphere Tower | concrete | Las Vegas | 350.22 | 27 |
Aon Center | steel | Chicago | 346.26 | 50 |
John Hancock Center | steel | Chicago | 343.69 | 54 |
Chrysler Building | steel | New York City | 318.9 | 93 |
... (1771 rows omitted)
Exercise: For each city, what’s the tallest building for each material?
tall_pivot = sky.pivot("material", "city", "height", max)
tall_pivot
city | concrete | mixed/composite | steel |
---|---|---|---|
Atlanta | 264.25 | 311.8 | 169.47 |
Austin | 208.15 | 0 | 93.6 |
Baltimore | 161.24 | 0 | 155.15 |
Boston | 121.92 | 139 | 240.79 |
Charlotte | 265.48 | 239.7 | 179.23 |
Chicago | 423.22 | 306.94 | 442.14 |
Cincinnati | 125 | 202.69 | 175 |
Cleveland | 125 | 288.65 | 215.8 |
Columbus | 79.25 | 0 | 169.3 |
Dallas | 176.48 | 280.72 | 270.06 |
... (25 rows omitted)
Click for Solution
tall_pivot = sky.pivot("material", "city", "height", max)
tall_pivot
tall_pivot.barh("city")
Exercise: For each city that has at least 50 buildings in the dataset, compute the average height of buildings broken down by material. Make a bar chart with the cities and the average height of buildings for each material.
num_buildings = sky.group("city")
(
sky
.join("city", num_buildings)
.where("count", are.above_or_equal_to(50))
.pivot("material", "city", "height", np.mean)
.barh("city")
)
Click for Solution
num_buildings = sky.group("city")
(
sky
.join("city", num_buildings)
.where("count", are.above_or_equal_to(50))
.pivot("material", "city", "height", np.mean)
.barh("city")
)
Exercise: Generate a table of the names and ages of the oldest buildings for each combination of city and materials.
def first(x):
return x[0]
(
sky
.sort("age", descending=True)
.select("city", "material", "name", "age")
.group(["city", "material"], first)
#.where("city", "San Francisco") #<- for fun
)
city | material | name first | age first |
---|---|---|---|
Atlanta | concrete | Westin Peachtree Plaza | 47 |
Atlanta | mixed/composite | One Atlantic Center | 36 |
Atlanta | steel | FlatironCity | 126 |
Austin | concrete | One American Center | 39 |
Austin | steel | University of Texas Tower | 86 |
Baltimore | concrete | Charles Towers North Apartments | 56 |
Baltimore | steel | Emerson Tower | 112 |
Boston | concrete | Harbor Towers I | 52 |
Boston | mixed/composite | Ellison Building | 31 |
Boston | steel | Marriott's Custom House | 108 |
... (81 rows omitted)