from datascience import *
import numpy as np
Exercise: Make an array of 4 elements:
my_array = make_array(1, 2, 3, 4)
my_array
array([1, 2, 3, 4])
my_array = make_array(1, 2, 3, 4)
my_array
Exercise: Arrays can be any type. Make an array of Strings
called string_array
:
string_array = make_array("cat", "dog", "bird")
string_array
array(['cat', 'dog', 'bird'], dtype='<U4')
string_array = make_array("cat", "dog", "bird")
string_array
Exercise: Mixing types (Strings, Numbers, Booleans). Make an array of multiple types:
weird_array = make_array("cat", 3, True)
weird_array
array(['cat', '3', 'True'], dtype='<U21')
weird_array = make_array("cat", 3, True)
weird_array
What is the type of weird_array
?
We use ranges to make arrays of number sequence easily. The numpy np.arange(start, stop, step)
function produce an array starting at start
and ending before stop
, in increments of step
.
Exercise: Make an array of the nubmers 0 through 6:
make_array(0, 1, 2, 3, 4, 5, 6)
array([0, 1, 2, 3, 4, 5, 6])
np.arange(0, 7, 1)
array([0, 1, 2, 3, 4, 5, 6])
np.arange(0, 7)
array([0, 1, 2, 3, 4, 5, 6])
np.arange(7)
array([0, 1, 2, 3, 4, 5, 6])
Exercise: What will the following produce:
np.arange(40, -1, -5)
array([40, 35, 30, 25, 20, 15, 10, 5, 0])
For this exercise lets start with this array of strings.
string_array = make_array("cat", "dog", "bird")
string_array
array(['cat', 'dog', 'bird'], dtype='<U4')
You can use array_name.item( NUMBER )
to get an element from an array.
Exercise: What will the following expression return?
string_array.item(1)
'dog'
Bonus! This is called array indexing. There is a shorter "equivalent" syntax that people will often use. However, for this class you only need to know about .item()
but you may use whatever you prefer.
string_array[1]
'dog'
Exercise: Use the len
function to determine the length of the string array.
len(string_array)
3
len(string_array)
Arrays also have a member variable array_name.size
that contains the size of the array.
Exercise: Use the size member variable to check the size of the array:
string_array.size
3
string_array.size
You will often need to compute summaries of an array like the sum
, max
, or the min
. These are all member functions of an array. Here is the documentation on all the member functions for arrays.
cool_numbers = make_array(0, 1, 42, np.pi, np.e)
cool_numbers
array([ 0. , 1. , 42. , 3.14159265, 2.71828183])
Exercise: Use the sum
, min
, mean
, and max
operations to summarize the cool numbers array.
print("sum", cool_numbers.sum())
print("min", cool_numbers.min())
print("mean", cool_numbers.mean())
print("max", cool_numbers.max())
sum 48.859874482 min 0.0 mean 9.77197489641 max 42.0
print("sum", cool_numbers.sum())
print("min", cool_numbers.min())
print("mean", cool_numbers.mean())
print("max", cool_numbers.max())
You can also use numpys built-in library of math functions on arrays. Here we compute the mean
and the log
:
print("np.average", np.average(my_array))
print("np.mean", np.mean(my_array))
print("np.log", np.log(my_array))
np.average 2.5 np.mean 2.5 np.log [ 0. 0.69314718 1.09861229 1.38629436]
You can do mathematical operations on arrays:
a = make_array(1, 2, 3, 4)
b = make_array(10, 20, 30, 40)
print("The a array:", a)
print("The b array:", b)
The a array: [1 2 3 4] The b array: [10 20 30 40]
Exercise: Add and multiply the arrays:
a + b
array([11, 22, 33, 44])
a * b
array([ 10, 40, 90, 160])
print("Adding Arrays", a + b)
print("Multiplying Arrays", a * b)
You can also add and multiply scalars
a * 3.
array([ 3., 6., 9., 12.])
3 + b
array([13, 23, 33, 43])
Exercise: What happens if we run the following:
bigger_array = make_array(1,2,3,4,5)
a * bigger_array
# bigger_array = make_array(1,2,3,4,5)
# a * bigger_array
Exercise: What happens if I run the following:
uhoh = make_array(0,1,2,3)
a / uhoh
# uhoh = make_array(0,1,2,3)
# a / uhoh
Exercise: What happens if I run the following:
a.item(4)
# a.item(4)
Exercise: What happens if I run the following:
a.item(-1)
a.item(-1)
4
Negative indexing is a common trick to access the end of an array.
We are covering arrays because this is the mathematical object that is returned when we work on specific columns of a table. Here we load a table of NBA salaries from a local file nba_salaries.csv
.
nba = Table.read_table('nba_salaries.csv')
nba
rank | name | position | team | salary | season |
---|---|---|---|---|---|
1 | Shaquille O'Neal | C | Los Angeles Lakers | 17142000 | 2000 |
2 | Kevin Garnett | PF | Minnesota Timberwolves | 16806000 | 2000 |
3 | Alonzo Mourning | C | Miami Heat | 15004000 | 2000 |
4 | Juwan Howard | PF | Washington Wizards | 15000000 | 2000 |
5 | Scottie Pippen | SF | Portland Trail Blazers | 14795000 | 2000 |
6 | Karl Malone | PF | Utah Jazz | 14000000 | 2000 |
7 | Larry Johnson | F | New York Knicks | 11910000 | 2000 |
8 | Gary Payton | PG | Seattle SuperSonics | 11020000 | 2000 |
9 | Rasheed Wallace | PF | Portland Trail Blazers | 10800000 | 2000 |
10 | Shawn Kemp | C | Cleveland Cavaliers | 10780000 | 2000 |
... (9446 rows omitted)
Let's focus on the Golden State Warriors.
Exercise: Use the my_table.where
function to select the rows where team is the "Golden State Warriors"
.
warriors = nba.where("team", "Golden State Warriors")
warriors
rank | name | position | team | salary | season |
---|---|---|---|---|---|
41 | Donyell Marshall | PF | Golden State Warriors | 5250000 | 2000 |
47 | Erick Dampier | C | Golden State Warriors | 4988000 | 2000 |
58 | Mookie Blaylock | G | Golden State Warriors | 4200000 | 2000 |
59 | Chris Mills | SF | Golden State Warriors | 4200000 | 2000 |
64 | Jason Caffey | F | Golden State Warriors | 3937000 | 2000 |
89 | Vonteego Cummings | PG | Golden State Warriors | 2600000 | 2000 |
92 | Antawn Jamison | PF | Golden State Warriors | 2503000 | 2000 |
73 | Erick Dampier | C | Golden State Warriors | 5611000 | 2001 |
91 | Mookie Blaylock | G | Golden State Warriors | 4800000 | 2001 |
92 | Chris Mills | SF | Golden State Warriors | 4800000 | 2001 |
... (301 rows omitted)
warriors = nba.where("team", "Golden State Warriors")
warriors
We can also select columns by name.
Exercise: Make a table with just the "name"
and "salary"
of the warriors.
warriors.select("name", "salary")
name | salary |
---|---|
Donyell Marshall | 5250000 |
Erick Dampier | 4988000 |
Mookie Blaylock | 4200000 |
Chris Mills | 4200000 |
Jason Caffey | 3937000 |
Vonteego Cummings | 2600000 |
Antawn Jamison | 2503000 |
Erick Dampier | 5611000 |
Mookie Blaylock | 4800000 |
Chris Mills | 4800000 |
... (301 rows omitted)
warriors.select("name", "salary")
Exercise: Compute the average average salary of the warriors. Which of the following works?
Option (A):
warriors.mean()
Option (B):
warriors.select("salary").mean()
Option (C):
warriors.column("salary").mean()
warriors.column("salary").mean()
4315935.9228295824
Exercise: Would the following work?
np.average(warriors.select("salary"))
# np.average(warriors.select("salary"))
type(warriors.select("salary"))
datascience.tables.Table
type(warriors.column("salary"))
numpy.ndarray
Exercise: Use np.average
to compute the average salary of the Warriors:
np.average(warriors.column("salary"))
4315935.9228295824
np.average(warriors.column("salary"))
Exercise: Compute the difference in the average salaries of the warriors and the "Los Angeles Lakers"
.
lakers = nba.where('team', 'Los Angeles Lakers')
warriors.column('salary').mean() - lakers.column('salary').mean()
-839856.02846911922
Let's start with an array of street names.
streets = make_array('Bancroft', 'Durant', 'Channing', 'Haste')
streets
array(['Bancroft', 'Durant', 'Channing', 'Haste'], dtype='<U8')
We can make an empty table (no rows, no columns, no problems ...).
The Table()
function makes an empty table.
empty_table = Table()
empty_table
Exercise: Check that the empty table has 0 rows and 0 columns
print("Rows:", empty_table.num_rows)
print("Cols:", empty_table.num_columns)
Rows: 0 Cols: 0
print("Rows:", empty_table.num_rows)
print("Cols:", empty_table.num_columns)
Exercise: Use the table.with_column
function to add a column to the table and save the new table as southside
.
southside = empty_table.with_column("Streets", streets)
southside
Streets |
---|
Bancroft |
Durant |
Channing |
Haste |
southside = empty_table.with_column("Streets", streets)
southside
Exercise: Can you do the same thing without using empty_table
?
southside = Table().with_column("Streets", streets)
southside
Streets |
---|
Bancroft |
Durant |
Channing |
Haste |
Exercise: What is the output of:
empty_table.with_column("Streets", streets)
print("Number of Columns", empty_table.num_columns)
Number of Columns 0
Exercise: Extend the southside table to include the blocks from campus. (map)
southside = southside.with_column('Blocks from campus', np.arange(4))
southside
Streets | Blocks from campus |
---|---|
Bancroft | 0 |
Durant | 1 |
Channing | 2 |
Haste | 3 |
Exercise: Build the entire table with blocks from campus in one call to the table.with_columns
function.
Table().with_columns(
'Streets', streets,
'Blocks from campus', np.arange(4)
)
Streets | Blocks from campus |
---|---|
Bancroft | 0 |
Durant | 1 |
Channing | 2 |
Haste | 3 |
From Wikipedia: William Edward Burghardt Du Bois (/djuːˈbɔɪs/ dew-BOYSS;[1][2] February 23, 1868 – August 27, 1963) was an American sociologist, socialist, historian, and Pan-Africanist civil rights activist. Born in Great Barrington, Massachusetts, Du Bois grew up in a relatively tolerant and integrated community. After completing graduate work at the University of Berlin and Harvard University, where he was the first African American to earn a doctorate, he became a professor of history, sociology, and economics at Atlanta University. Du Bois was one of the founders of the National Association for the Advancement of Colored People (NAACP) in 1909.
For more context on the visualization in lecture checkout Du Bois’ Data Portraits Tell A Story About Black Life In Georgia And Beyond
du_bois = Table.read_table('du_bois.csv')
du_bois
CLASS | ACTUAL AVERAGE | RENT | FOOD | CLOTHES | TAXES | OTHER | STATUS |
---|---|---|---|---|---|---|---|
100-200 | 139.1 | 0.19 | 0.43 | 0.28 | 0.001 | 0.099 | POOR |
200-300 | 249.45 | 0.22 | 0.47 | 0.23 | 0.04 | 0.04 | POOR |
300-400 | 335.66 | 0.23 | 0.43 | 0.18 | 0.045 | 0.115 | FAIR |
400-500 | 433.82 | 0.18 | 0.37 | 0.15 | 0.055 | 0.245 | FAIR |
500-750 | 547 | 0.13 | 0.31 | 0.17 | 0.05 | 0.34 | COMFORTABLE |
750-1000 | 880 | 0 | 0.37 | 0.19 | 0.08 | 0.36 | COMFORTABLE |
1000 and over | 1125 | 0 | 0.29 | 0.16 | 0.045 | 0.505 | WELL-TO-DO |
Exercise: Compute the amount of money spent on food and add it to the table and add it to the table as "FOOD $"
:
du_bois = du_bois.with_columns(
"FOOD $", du_bois.column('ACTUAL AVERAGE') * du_bois.column('FOOD'))
du_bois
CLASS | ACTUAL AVERAGE | RENT | FOOD | CLOTHES | TAXES | OTHER | STATUS | FOOD $ |
---|---|---|---|---|---|---|---|---|
100-200 | 139.1 | 0.19 | 0.43 | 0.28 | 0.001 | 0.099 | POOR | 59.813 |
200-300 | 249.45 | 0.22 | 0.47 | 0.23 | 0.04 | 0.04 | POOR | 117.241 |
300-400 | 335.66 | 0.23 | 0.43 | 0.18 | 0.045 | 0.115 | FAIR | 144.334 |
400-500 | 433.82 | 0.18 | 0.37 | 0.15 | 0.055 | 0.245 | FAIR | 160.513 |
500-750 | 547 | 0.13 | 0.31 | 0.17 | 0.05 | 0.34 | COMFORTABLE | 169.57 |
750-1000 | 880 | 0 | 0.37 | 0.19 | 0.08 | 0.36 | COMFORTABLE | 325.6 |
1000 and over | 1125 | 0 | 0.29 | 0.16 | 0.045 | 0.505 | WELL-TO-DO | 326.25 |
du_bois = du_bois.with_columns(
"FOOD $", du_bois.column('ACTUAL AVERAGE') * du_bois.column('FOOD'))
du_bois
Exercise: Use the table functions we learned this week to find the income bracket ("class") that spent the most money on rent.
du_bois = du_bois.with_columns("RENT $",
du_bois.column("RENT") * du_bois.column("ACTUAL AVERAGE"))
du_bois.sort("RENT $", descending = True)
CLASS | ACTUAL AVERAGE | RENT | FOOD | CLOTHES | TAXES | OTHER | STATUS | FOOD $ | RENT $ |
---|---|---|---|---|---|---|---|---|---|
400-500 | 433.82 | 0.18 | 0.37 | 0.15 | 0.055 | 0.245 | FAIR | 160.513 | 78.0876 |
300-400 | 335.66 | 0.23 | 0.43 | 0.18 | 0.045 | 0.115 | FAIR | 144.334 | 77.2018 |
500-750 | 547 | 0.13 | 0.31 | 0.17 | 0.05 | 0.34 | COMFORTABLE | 169.57 | 71.11 |
200-300 | 249.45 | 0.22 | 0.47 | 0.23 | 0.04 | 0.04 | POOR | 117.241 | 54.879 |
100-200 | 139.1 | 0.19 | 0.43 | 0.28 | 0.001 | 0.099 | POOR | 59.813 | 26.429 |
750-1000 | 880 | 0 | 0.37 | 0.19 | 0.08 | 0.36 | COMFORTABLE | 325.6 | 0 |
1000 and over | 1125 | 0 | 0.29 | 0.16 | 0.045 | 0.505 | WELL-TO-DO | 326.25 | 0 |
du_bois = du_bois.with_columns("RENT $",
du_bois.column("RENT") * du_bois.column("ACTUAL AVERAGE"))
du_bois.sort("RENT $", descending = True)