Lecture 5¶

In [1]:
from datascience import *
import numpy as np

Arrays¶

Arrays are ordered "lists" of elements that can be directly accessed by location.

Making Arrays¶

Exercise: Make an array of 4 elements:

In [2]:
my_array = make_array(1, 2, 3, 4)
my_array
Out[2]:
array([1, 2, 3, 4])
Solution
my_array = make_array(1, 2, 3, 4)
my_array


Exercise: Arrays can be any type. Make an array of Strings called string_array:

In [3]:
string_array = make_array("cat", "dog", "bird")
string_array
Out[3]:
array(['cat', 'dog', 'bird'],
      dtype='<U4')
Solution
string_array = make_array("cat", "dog", "bird")
string_array


Exercise: Mixing types (Strings, Numbers, Booleans). Make an array of multiple types:

In [4]:
weird_array = make_array("cat", 3, True)
weird_array
Out[4]:
array(['cat', '3', 'True'],
      dtype='<U21')
Solution
weird_array = make_array("cat", 3, True)
weird_array


What is the type of weird_array?



Ranges¶

We use ranges to make arrays of number sequence easily. The numpy np.arange(start, stop, step) function produce an array starting at start and ending before stop, in increments of step.

Exercise: Make an array of the nubmers 0 through 6:

In [5]:
make_array(0, 1, 2, 3, 4, 5, 6)
Out[5]:
array([0, 1, 2, 3, 4, 5, 6])
In [6]:
np.arange(0, 7, 1)
Out[6]:
array([0, 1, 2, 3, 4, 5, 6])
In [7]:
np.arange(0, 7)
Out[7]:
array([0, 1, 2, 3, 4, 5, 6])
In [8]:
np.arange(7)
Out[8]:
array([0, 1, 2, 3, 4, 5, 6])

Exercise: What will the following produce:

In [9]:
np.arange(40, -1, -5) 
Out[9]:
array([40, 35, 30, 25, 20, 15, 10,  5,  0])




Accessing Elements¶

For this exercise lets start with this array of strings.

In [10]:
string_array = make_array("cat", "dog", "bird")
string_array
Out[10]:
array(['cat', 'dog', 'bird'],
      dtype='<U4')

You can use array_name.item( NUMBER ) to get an element from an array.

Exercise: What will the following expression return?

In [11]:
string_array.item(1)
Out[11]:
'dog'

Bonus! This is called array indexing. There is a shorter "equivalent" syntax that people will often use. However, for this class you only need to know about .item() but you may use whatever you prefer.

In [12]:
string_array[1]
Out[12]:
'dog'

Exercise: Use the len function to determine the length of the string array.

In [13]:
len(string_array)
Out[13]:
3
Solution
len(string_array)


Arrays also have a member variable array_name.size that contains the size of the array.

Exercise: Use the size member variable to check the size of the array:

In [14]:
string_array.size
Out[14]:
3
Solution
string_array.size





Aggregation Operations¶

You will often need to compute summaries of an array like the sum, max, or the min. These are all member functions of an array. Here is the documentation on all the member functions for arrays.

In [15]:
cool_numbers = make_array(0, 1, 42, np.pi, np.e)
cool_numbers
Out[15]:
array([  0.        ,   1.        ,  42.        ,   3.14159265,   2.71828183])

Exercise: Use the sum, min, mean, and max operations to summarize the cool numbers array.

In [16]:
print("sum", cool_numbers.sum())
print("min", cool_numbers.min())
print("mean", cool_numbers.mean())
print("max", cool_numbers.max())
sum 48.859874482
min 0.0
mean 9.77197489641
max 42.0
Solution
print("sum", cool_numbers.sum())
print("min", cool_numbers.min())
print("mean", cool_numbers.mean())
print("max", cool_numbers.max())


You can also use numpys built-in library of math functions on arrays. Here we compute the mean and the log:

In [17]:
print("np.average", np.average(my_array))
print("np.mean", np.mean(my_array))
print("np.log", np.log(my_array))
np.average 2.5
np.mean 2.5
np.log [ 0.          0.69314718  1.09861229  1.38629436]




Doing math with arrays¶

You can do mathematical operations on arrays:

In [18]:
a = make_array(1, 2, 3, 4)
b = make_array(10, 20, 30, 40)
print("The a array:", a)
print("The b array:", b)
The a array: [1 2 3 4]
The b array: [10 20 30 40]

Exercise: Add and multiply the arrays:

In [19]:
a + b
Out[19]:
array([11, 22, 33, 44])
In [20]:
a * b
Out[20]:
array([ 10,  40,  90, 160])
Solution
print("Adding Arrays", a + b)
print("Multiplying Arrays", a * b)


You can also add and multiply scalars

In [21]:
a * 3.
Out[21]:
array([  3.,   6.,   9.,  12.])
In [22]:
3 + b
Out[22]:
array([13, 23, 33, 43])




Common Bugs¶

Exercise: What happens if we run the following:

bigger_array = make_array(1,2,3,4,5)
a * bigger_array
In [23]:
# bigger_array = make_array(1,2,3,4,5)
# a * bigger_array

Exercise: What happens if I run the following:

uhoh = make_array(0,1,2,3)
a / uhoh
In [24]:
# uhoh = make_array(0,1,2,3)
# a / uhoh

Exercise: What happens if I run the following:

a.item(4)
In [25]:
# a.item(4)

Exercise: What happens if I run the following:

a.item(-1)
In [26]:
a.item(-1)
Out[26]:
4

Negative indexing is a common trick to access the end of an array.






Tables are Made of Arrays¶

We are covering arrays because this is the mathematical object that is returned when we work on specific columns of a table. Here we load a table of NBA salaries from a local file nba_salaries.csv.

In [27]:
nba = Table.read_table('nba_salaries.csv')
nba
Out[27]:
rank name position team salary season
1 Shaquille O'Neal C Los Angeles Lakers 17142000 2000
2 Kevin Garnett PF Minnesota Timberwolves 16806000 2000
3 Alonzo Mourning C Miami Heat 15004000 2000
4 Juwan Howard PF Washington Wizards 15000000 2000
5 Scottie Pippen SF Portland Trail Blazers 14795000 2000
6 Karl Malone PF Utah Jazz 14000000 2000
7 Larry Johnson F New York Knicks 11910000 2000
8 Gary Payton PG Seattle SuperSonics 11020000 2000
9 Rasheed Wallace PF Portland Trail Blazers 10800000 2000
10 Shawn Kemp C Cleveland Cavaliers 10780000 2000

... (9446 rows omitted)

Let's focus on the Golden State Warriors.

Exercise: Use the my_table.where function to select the rows where team is the "Golden State Warriors".

In [28]:
warriors = nba.where("team", "Golden State Warriors")
warriors
Out[28]:
rank name position team salary season
41 Donyell Marshall PF Golden State Warriors 5250000 2000
47 Erick Dampier C Golden State Warriors 4988000 2000
58 Mookie Blaylock G Golden State Warriors 4200000 2000
59 Chris Mills SF Golden State Warriors 4200000 2000
64 Jason Caffey F Golden State Warriors 3937000 2000
89 Vonteego Cummings PG Golden State Warriors 2600000 2000
92 Antawn Jamison PF Golden State Warriors 2503000 2000
73 Erick Dampier C Golden State Warriors 5611000 2001
91 Mookie Blaylock G Golden State Warriors 4800000 2001
92 Chris Mills SF Golden State Warriors 4800000 2001

... (301 rows omitted)

Solution
warriors = nba.where("team", "Golden State Warriors")
warriors


We can also select columns by name.

Exercise: Make a table with just the "name" and "salary" of the warriors.

In [29]:
warriors.select("name", "salary")
Out[29]:
name salary
Donyell Marshall 5250000
Erick Dampier 4988000
Mookie Blaylock 4200000
Chris Mills 4200000
Jason Caffey 3937000
Vonteego Cummings 2600000
Antawn Jamison 2503000
Erick Dampier 5611000
Mookie Blaylock 4800000
Chris Mills 4800000

... (301 rows omitted)

Solution
warriors.select("name", "salary")


Exercise: Compute the average average salary of the warriors. Which of the following works?

Option (A):

warriors.mean()

Option (B):

warriors.select("salary").mean()

Option (C):

warriors.column("salary").mean()
In [30]:
warriors.column("salary").mean()
Out[30]:
4315935.9228295824

Exercise: Would the following work?

np.average(warriors.select("salary"))
In [31]:
# np.average(warriors.select("salary"))
In [32]:
type(warriors.select("salary"))
Out[32]:
datascience.tables.Table
In [33]:
type(warriors.column("salary"))
Out[33]:
numpy.ndarray

Exercise: Use np.average to compute the average salary of the Warriors:

In [34]:
np.average(warriors.column("salary"))
Out[34]:
4315935.9228295824
Solution
np.average(warriors.column("salary"))


Exercise: Compute the difference in the average salaries of the warriors and the "Los Angeles Lakers".

In [35]:
lakers = nba.where('team', 'Los Angeles Lakers')
warriors.column('salary').mean() - lakers.column('salary').mean()
Out[35]:
-839856.02846911922

Creating a Table from Arrays¶

Let's start with an array of street names.

In [36]:
streets = make_array('Bancroft', 'Durant', 'Channing', 'Haste')
streets
Out[36]:
array(['Bancroft', 'Durant', 'Channing', 'Haste'],
      dtype='<U8')

We can make an empty table (no rows, no columns, no problems ...).

The Table() function makes an empty table.

In [37]:
empty_table = Table()
empty_table
Out[37]:

Exercise: Check that the empty table has 0 rows and 0 columns

In [38]:
print("Rows:", empty_table.num_rows)
print("Cols:", empty_table.num_columns)
Rows: 0
Cols: 0
Solution
print("Rows:", empty_table.num_rows)
print("Cols:", empty_table.num_columns)


Exercise: Use the table.with_column function to add a column to the table and save the new table as southside.

In [39]:
southside = empty_table.with_column("Streets", streets)
southside
Out[39]:
Streets
Bancroft
Durant
Channing
Haste
Solution
southside = empty_table.with_column("Streets", streets)
southside


Exercise: Can you do the same thing without using empty_table?

In [40]:
southside = Table().with_column("Streets", streets)
southside
Out[40]:
Streets
Bancroft
Durant
Channing
Haste

Exercise: What is the output of:

In [41]:
empty_table.with_column("Streets", streets)
print("Number of Columns", empty_table.num_columns)
Number of Columns 0

Exercise: Extend the southside table to include the blocks from campus. (map)

In [42]:
southside = southside.with_column('Blocks from campus', np.arange(4))
southside
Out[42]:
Streets Blocks from campus
Bancroft 0
Durant 1
Channing 2
Haste 3

Exercise: Build the entire table with blocks from campus in one call to the table.with_columns function.

In [43]:
Table().with_columns(
    'Streets', streets,
    'Blocks from campus', np.arange(4)
)
Out[43]:
Streets Blocks from campus
Bancroft 0
Durant 1
Channing 2
Haste 3

Case Study: Understanding the W. E. B. Du Bois Visualization¶

Picture from Wikipedia

From Wikipedia: William Edward Burghardt Du Bois (/djuːˈbɔɪs/ dew-BOYSS;[1][2] February 23, 1868 – August 27, 1963) was an American sociologist, socialist, historian, and Pan-Africanist civil rights activist. Born in Great Barrington, Massachusetts, Du Bois grew up in a relatively tolerant and integrated community. After completing graduate work at the University of Berlin and Harvard University, where he was the first African American to earn a doctorate, he became a professor of history, sociology, and economics at Atlanta University. Du Bois was one of the founders of the National Association for the Advancement of Colored People (NAACP) in 1909.

For more context on the visualization in lecture checkout Du Bois’ Data Portraits Tell A Story About Black Life In Georgia And Beyond

In [44]:
du_bois = Table.read_table('du_bois.csv')
du_bois
Out[44]:
CLASS ACTUAL AVERAGE RENT FOOD CLOTHES TAXES OTHER STATUS
100-200 139.1 0.19 0.43 0.28 0.001 0.099 POOR
200-300 249.45 0.22 0.47 0.23 0.04 0.04 POOR
300-400 335.66 0.23 0.43 0.18 0.045 0.115 FAIR
400-500 433.82 0.18 0.37 0.15 0.055 0.245 FAIR
500-750 547 0.13 0.31 0.17 0.05 0.34 COMFORTABLE
750-1000 880 0 0.37 0.19 0.08 0.36 COMFORTABLE
1000 and over 1125 0 0.29 0.16 0.045 0.505 WELL-TO-DO

Exercise: Compute the amount of money spent on food and add it to the table and add it to the table as "FOOD $":

In [45]:
du_bois = du_bois.with_columns(
    "FOOD $", du_bois.column('ACTUAL AVERAGE') * du_bois.column('FOOD'))
du_bois
Out[45]:
CLASS ACTUAL AVERAGE RENT FOOD CLOTHES TAXES OTHER STATUS FOOD $
100-200 139.1 0.19 0.43 0.28 0.001 0.099 POOR 59.813
200-300 249.45 0.22 0.47 0.23 0.04 0.04 POOR 117.241
300-400 335.66 0.23 0.43 0.18 0.045 0.115 FAIR 144.334
400-500 433.82 0.18 0.37 0.15 0.055 0.245 FAIR 160.513
500-750 547 0.13 0.31 0.17 0.05 0.34 COMFORTABLE 169.57
750-1000 880 0 0.37 0.19 0.08 0.36 COMFORTABLE 325.6
1000 and over 1125 0 0.29 0.16 0.045 0.505 WELL-TO-DO 326.25
Solution
du_bois = du_bois.with_columns(
    "FOOD $", du_bois.column('ACTUAL AVERAGE') * du_bois.column('FOOD'))
du_bois


Exercise: Use the table functions we learned this week to find the income bracket ("class") that spent the most money on rent.

In [46]:
du_bois = du_bois.with_columns("RENT $", 
    du_bois.column("RENT") * du_bois.column("ACTUAL AVERAGE"))
du_bois.sort("RENT $", descending = True)
Out[46]:
CLASS ACTUAL AVERAGE RENT FOOD CLOTHES TAXES OTHER STATUS FOOD $ RENT $
400-500 433.82 0.18 0.37 0.15 0.055 0.245 FAIR 160.513 78.0876
300-400 335.66 0.23 0.43 0.18 0.045 0.115 FAIR 144.334 77.2018
500-750 547 0.13 0.31 0.17 0.05 0.34 COMFORTABLE 169.57 71.11
200-300 249.45 0.22 0.47 0.23 0.04 0.04 POOR 117.241 54.879
100-200 139.1 0.19 0.43 0.28 0.001 0.099 POOR 59.813 26.429
750-1000 880 0 0.37 0.19 0.08 0.36 COMFORTABLE 325.6 0
1000 and over 1125 0 0.29 0.16 0.045 0.505 WELL-TO-DO 326.25 0
Solution
du_bois = du_bois.with_columns("RENT $", 
    du_bois.column("RENT") * du_bois.column("ACTUAL AVERAGE"))
du_bois.sort("RENT $", descending = True)