Data 8 datascience Reference

This notebook serves as an interactive, Data 8-friendly reference for the datascience library.

Table Functions and Methods

Table()

Create an empty table, usually to extend with data

[29]:
new_table = Table()
new_table
[29]:
[30]:
type(new_table)
[30]:
datascience.tables.Table

Table.read_table()

Table.read_table(filename)

Creates a table by reading the CSV file named filename (a string).

[31]:
trips = Table.read_table('https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/trip.csv')
trips
[31]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117
723352 519 4/13/2015 17:04 Howard at 2nd 63 4/13/2015 17:12 San Francisco Caltrain (Townsend at 4th) 70 629 Subscriber 94061
524499 431 10/31/2014 16:36 Townsend at 7th 65 10/31/2014 16:43 Civic Center BART (7th at Market) 72 630 Subscriber 94706
518524 389 10/28/2014 8:48 Market at Sansome 77 10/28/2014 8:54 2nd at South Park 64 458 Subscriber 94610
710070 11460 4/2/2015 18:13 Powell Street BART 39 4/2/2015 21:24 Powell Street BART 39 375 Subscriber 94107
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 6/4/2015 5:36 Embarcadero at Sansome 60 289 Subscriber 94105

... (99990 rows omitted)

tbl.with_column

tbl = Table()
tbl.with_column(name, values)
tbl.with_columns(n1, v1, n2, v2,...)

Creates a new table by adding a column with name name and values values to another table. name should be a string and values should have as many entries as there are rows in the original table. If values is a single value, then every row of that column has the value values.

In the examples below, we start with adding a column to the existing table trips with values being an array we construct from existing tables.

[32]:
trips.with_column(
    "Difference in terminal", abs(trips.column("Start Terminal") - trips.column("End Terminal"))
)
[32]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code Difference in terminal
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518 7
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111 10
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025 2
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925 2
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117 2
723352 519 4/13/2015 17:04 Howard at 2nd 63 4/13/2015 17:12 San Francisco Caltrain (Townsend at 4th) 70 629 Subscriber 94061 7
524499 431 10/31/2014 16:36 Townsend at 7th 65 10/31/2014 16:43 Civic Center BART (7th at Market) 72 630 Subscriber 94706 7
518524 389 10/28/2014 8:48 Market at Sansome 77 10/28/2014 8:54 2nd at South Park 64 458 Subscriber 94610 13
710070 11460 4/2/2015 18:13 Powell Street BART 39 4/2/2015 21:24 Powell Street BART 39 375 Subscriber 94107 0
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 6/4/2015 5:36 Embarcadero at Sansome 60 289 Subscriber 94105 6

... (99990 rows omitted)

We can also create a new table by adding two new columns with column name followed by the array values.

[33]:
cookies = Table()
cookies = cookies.with_columns(
    "Cookie", make_array("Sugar cookies", "Chocolate chip", "Red velvet", "Oatmeal raisin", "Peanut butter"),
    "Quantity", make_array(10, 15, 15, 10, 5)
)
cookies
[33]:
Cookie Quantity
Sugar cookies 10
Chocolate chip 15
Red velvet 15
Oatmeal raisin 10
Peanut butter 5
[34]:
prices = make_array(1.00, 1.50, 1.75, 1.25, 1.00)
cookies = cookies.with_column("Price ($)", prices)
cookies
[34]:
Cookie Quantity Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1

In the last examples, we add a new column Delicious with one value “yes,” and we see every column has the same value.

[35]:
cookies.with_column("Delicious", "yes")
[35]:
Cookie Quantity Price ($) Delicious
Sugar cookies 10 1 yes
Chocolate chip 15 1.5 yes
Red velvet 15 1.75 yes
Oatmeal raisin 10 1.25 yes
Peanut butter 5 1 yes

tbl.column()

tbl.column(column_name_or_index)

Outputs an array of values of the column column_name_or_index. column_name_or_index is a string of the column name or number which is the index of the column.

In the examples below, we start with an array of the Cookie column from the table cookies first by the column name then by using the index of the column.

[36]:
cookies.column("Cookie")
[36]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
       'Peanut butter'], dtype='<U14')
[37]:
cookies.column(0)
[37]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
       'Peanut butter'], dtype='<U14')

tbl.num_rows

Computes the number of rows in a table.

[38]:
trips.num_rows
[38]:
100000
[39]:
cookies.num_rows
[39]:
5

tbl.num_columns

Computes the number of columns in a table.

[40]:
trips.num_columns
[40]:
11
[41]:
cookies.num_columns
[41]:
3

tbl.labels

Outputs the column labels in a table.

[42]:
trips.labels
[42]:
('Trip ID',
 'Duration',
 'Start Date',
 'Start Station',
 'Start Terminal',
 'End Date',
 'End Station',
 'End Terminal',
 'Bike #',
 'Subscriber Type',
 'Zip Code')
[43]:
cookies.labels
[43]:
('Cookie', 'Quantity', 'Price ($)')

tbl.select()

tbl.select(col1, col2, ...)

Creates a copy of a table with only the selected columns. Each column is the column name as a string or the integer index of the column.

Suppose we want to select the Trip ID, Duration, Bike #, and Zip Code columns from the trips table.

[44]:
trips.select("Trip ID", "Duration", "Bike #", "Zip Code")
[44]:
Trip ID Duration Bike # Zip Code
876419 413 269 94518
459672 408 429 94111
903647 723 631 94025
452829 409 428 94925
491023 224 144 94117
723352 519 629 94061
524499 431 630 94706
518524 389 458 94610
710070 11460 375 94107
793149 616 289 94105

... (99990 rows omitted)

Similarly, we can use indexes to select columns. Remember to start indexing at 0.

[45]:
trips.select(0, 1, 8, 10).show(5)
Trip ID Duration Bike # Zip Code
876419 413 269 94518
459672 408 429 94111
903647 723 631 94025
452829 409 428 94925
491023 224 144 94117

... (99995 rows omitted)

tbl.drop()

tbl.drop(col1, col2, ...)

Creates a copy of a table without the specified columns. Each column is the column name as a string or integer index.

[46]:
cookies.drop("Quantity")
[46]:
Cookie Price ($)
Sugar cookies 1
Chocolate chip 1.5
Red velvet 1.75
Oatmeal raisin 1.25
Peanut butter 1
[47]:
trips.drop("End Date", "Subscriber Type")
[47]:
Trip ID Duration Start Date Start Station Start Terminal End Station End Terminal Bike # Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 Townsend at 7th 65 269 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 Embarcadero at Sansome 60 429 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 Market at 10th 67 631 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 Market at 4th 76 428 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 San Jose Diridon Caltrain Station 2 144 94117
723352 519 4/13/2015 17:04 Howard at 2nd 63 San Francisco Caltrain (Townsend at 4th) 70 629 94061
524499 431 10/31/2014 16:36 Townsend at 7th 65 Civic Center BART (7th at Market) 72 630 94706
518524 389 10/28/2014 8:48 Market at Sansome 77 2nd at South Park 64 458 94610
710070 11460 4/2/2015 18:13 Powell Street BART 39 Powell Street BART 39 375 94107
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 Embarcadero at Sansome 60 289 94105

... (99990 rows omitted)

[48]:
trips.drop(3, 6, 8, 9, 10)
[48]:
Trip ID Duration Start Date Start Terminal End Date End Terminal
876419 413 8/5/2015 8:29 72 8/5/2015 8:36 65
459672 408 9/18/2014 17:11 50 9/18/2014 17:17 60
903647 723 8/25/2015 7:26 69 8/25/2015 7:38 67
452829 409 9/15/2014 8:29 74 9/15/2014 8:36 76
491023 224 10/9/2014 16:13 4 10/9/2014 16:17 2
723352 519 4/13/2015 17:04 63 4/13/2015 17:12 70
524499 431 10/31/2014 16:36 65 10/31/2014 16:43 72
518524 389 10/28/2014 8:48 77 10/28/2014 8:54 64
710070 11460 4/2/2015 18:13 39 4/2/2015 21:24 39
793149 616 6/4/2015 5:26 54 6/4/2015 5:36 60

... (99990 rows omitted)

tbl.relabel()

tbl.relabel(old_label, new_label)

Modifies the table by changing the label of the column named old_label to new_label. old_label can be a string column name or an integer index.

[49]:
cookies
[49]:
Cookie Quantity Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[50]:
cookies.relabel("Quantity", "Amount remaining")
[50]:
Cookie Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[51]:
cookies.relabel(0, "Type")
[51]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[52]:
cookies
[52]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1

tbl.show()

tbl.show(n)

Displays the first n rows of a table. If no n is provided, displays all rows.

[53]:
trips.show(5)
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117

... (99995 rows omitted)

tbl.sort()

tbl.sort(column_name_or_index, descending=False)

Sorts the rows in the table by the values in the column column_name_or_index in ascending order by default. Set descending=True to sort in descending order. column_name_or_index can be a string column label or an integer index.

[54]:
cookies
[54]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[55]:
cookies.sort("Price ($)")
[55]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Peanut butter 5 1
Oatmeal raisin 10 1.25
Chocolate chip 15 1.5
Red velvet 15 1.75
[56]:
# sort in descending order
cookies.sort("Amount remaining", descending = True)
[56]:
Type Amount remaining Price ($)
Red velvet 15 1.75
Chocolate chip 15 1.5
Oatmeal raisin 10 1.25
Sugar cookies 10 1
Peanut butter 5 1
[57]:
# alphabetical order
cookies.sort(0)
[57]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
Oatmeal raisin 10 1.25
Peanut butter 5 1
Red velvet 15 1.75
Sugar cookies 10 1

tbl.where()

tbl.where(column, predicate)

Filters the table for rows where the predicate is true. predicate should be one of the provided are.<something> functions. column can be a string column label or an integer index. A list of available predicates can be found below.

[58]:
cookies.where("Amount remaining", are.above(10))
[58]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
Red velvet 15 1.75
[59]:
cookies.where(0, are.equal_to("Chocolate chip"))
[59]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
[62]:
# if predicate is a value, look for rows where the column == the value
# equivalent to cookies.where(1, are.eual_to(15))
cookies.where(1, 15)
[62]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
Red velvet 15 1.75
[63]:
cookies.where("Price ($)", are.below(1.25))
[63]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Peanut butter 5 1

tbl.take()

tbl.take(row_index, ...)

Returns a copy of the table with only the specified rows included. Rows are specified by their integer index, so 0 for the first, 1 for the second, etc.

[64]:
cookies
[64]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[65]:
cookies.take(0)
[65]:
Type Amount remaining Price ($)
Sugar cookies 10 1
[66]:
cookies.take(cookies.num_rows - 1)
[66]:
Type Amount remaining Price ($)
Peanut butter 5 1
[67]:
cookies.take(0, 1, 2)
[67]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75

Table Visualizations

[68]:
actors = Table().read_table("https://github.com/data-8/textbook/raw/gh-pages/data/actors.csv")
actors
[68]:
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Tom Hanks 4340.8 44 98.7 Toy Story 3 415
Robert Downey, Jr. 3947.3 53 74.5 The Avengers 623.4
Eddie Murphy 3810.4 38 100.3 Shrek 2 441.2
Tom Cruise 3587.2 36 99.6 War of the Worlds 234.3
Johnny Depp 3368.6 45 74.9 Dead Man's Chest 423.3
Michael Caine 3351.5 58 57.8 The Dark Knight 534.9
Scarlett Johansson 3341.2 37 90.3 The Avengers 623.4

... (40 rows omitted)

tbl.scatter()

tbl.scatter(x_column, y_column, fit_line=False)

Creates a scatter plot with x_column on the horizontal axis and y_column on the vertical axis. These labels can be column names as strings or integer indices. Set fit_line=True to include a line of best fit for the data. You can find more examples in the textbook.

[71]:
actors.scatter('Number of Movies', 'Total Gross')
../_images/reference-nb_datascience-reference_60_0.png
[73]:
actors.scatter(2, 3, fit_line=True)
../_images/reference-nb_datascience-reference_61_0.png

tbl.plot()

tbl.plot(x_column, y_column)

Plot a line graph with x_column on the horizontal axis and y_column on the vertical axis. Sorts the table in ascending order by values in x_column first. x_column and y_column can be column names as strings or integer indices.

[74]:
movies_by_year = Table.read_table('https://github.com/data-8/textbook/raw/gh-pages/data/movies_by_year.csv')
movies_by_year.show(3)
Year Total Gross Number of Movies #1 Movie
2015 11128.5 702 Star Wars: The Force Awakens
2014 10360.8 702 American Sniper
2013 10923.6 688 Catching Fire

... (33 rows omitted)

[75]:
movies_by_year.plot('Year', 'Number of Movies')
../_images/reference-nb_datascience-reference_64_0.png

tbl.barh()

tbl.barh(categories)
tbl.barh(categories, values)

Plots a horizontal bar chart broken down by categories as the bars. If values is unspecified, one bar for each column of the table (except categories) is plotted. categories and values can be column names as strings or integer indices.

[76]:
cookies.barh("Type")
../_images/reference-nb_datascience-reference_66_0.png
[77]:
cookies.barh("Type", "Amount remaining")
../_images/reference-nb_datascience-reference_67_0.png
[78]:
cookies.barh(0, 2)
../_images/reference-nb_datascience-reference_68_0.png

tbl.hist()

tbl.hist(column)
tbl.hist(column, bins=...)

Plot a histogram of the values in column. Defaults to 10 bins of equal width. If bins is specified, it can be a number of bins to use (e.g. bins=25 will produce a histogram with 25 bins) or an array of values to use as bins (e.g. bins=make_array(1, 3, 4) will produce 2 bins: \([1, 3)\) and \([3, 4)\)). column can be column names as strings or integer indices.

[79]:
actors.hist(3)
../_images/reference-nb_datascience-reference_70_0.png
[80]:
actors.hist("Gross")
../_images/reference-nb_datascience-reference_71_0.png

Table.interactive_plots()

Table.interactive_plots()

This function will change from static plots like the ones above to interactive plots made with plotly. If a plotting method has a plotly version, that method will be used instead.

[193]:
Table.interactive_plots()
actors.scatter("Total Gross", "Gross")

Table.static_plots()

Table.static_plots()

This function turns off plotly plots.

[194]:
Table.static_plots()
actors.scatter("Total Gross", "Gross")
../_images/reference-nb_datascience-reference_75_0.png

Advanced Table Functions

tbl.apply()

tbl.apply(function, column)
tbl.apply(function, col1, col2, ...)

Applies the function function to each element of the column column and returns the values returned as an array. If function takes more than one argument, you can specify multiple columns to use for each argument in order.

[65]:
actors.apply(np.average, "Number of Movies")
[65]:
array([41., 69., 61., 44., 53., 38., 36., 45., 58., 37., 38., 49., 60.,
       43.,  7., 31., 24., 50., 39., 79., 34., 63., 23., 34., 37., 36.,
       17., 50., 39., 42., 35., 36., 17., 46., 40., 32., 17., 29., 39.,
       26., 25., 36., 27., 43., 25., 21., 40., 35., 23., 25.])
[82]:
actors
[82]:
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Tom Hanks 4340.8 44 98.7 Toy Story 3 415
Robert Downey, Jr. 3947.3 53 74.5 The Avengers 623.4
Eddie Murphy 3810.4 38 100.3 Shrek 2 441.2
Tom Cruise 3587.2 36 99.6 War of the Worlds 234.3
Johnny Depp 3368.6 45 74.9 Dead Man's Chest 423.3
Michael Caine 3351.5 58 57.8 The Dark Knight 534.9
Scarlett Johansson 3341.2 37 90.3 The Avengers 623.4

... (40 rows omitted)

The example below calculates the average gross for each movie by actor by applying a function that takes in the value of Total Gross and Number of Movies and returns their quotient.

[83]:
def average_gross(total_gross, num_movies):
    return total_gross / num_movies

actors.apply(average_gross, "Total Gross", "Number of Movies")
[83]:
array([118.82195122,  69.17101449,  73.25081967,  98.65454545,
        74.47735849, 100.27368421,  99.64444444,  74.85777778,
        57.78448276,  90.3027027 ,  86.68421053,  66.9244898 ,
        53.15666667,  73.8372093 , 451.84285714, 101.62580645,
       131.2125    ,  62.478     ,  79.67435897,  39.00379747,
        89.16764706,  46.70952381, 125.67826087,  84.86470588,
        76.40540541,  78.38888889, 165.63529412,  56.316     ,
        71.86153846,  65.12619048,  77.89428571,  75.425     ,
       157.75882353,  58.28913043,  67.0225    ,  83.15625   ,
       154.96470588,  89.83103448,  66.72564103,  99.25384615,
       102.308     ,  70.82777778,  94.26666667,  58.65348837,
       100.732     , 119.06190476,  61.5925    ,  70.36      ,
       106.86086957,  96.66      ])

tbl.group()

tbl.group(column_or_columns)
tbl.group(column_or_columns, func)

Groups a table by values in column_or_columns. If column_or_columns is an array, groups by each unique combination of elements in those columns. If func is specified, it should be a function that takes in an array of values and returns a single value. If unspecified, this defaults to the count of rows in the set.

[84]:
trips.group("Start Station")
[84]:
Start Station count
2nd at Folsom 2302
2nd at South Park 2610
2nd at Townsend 3904
5th at Howard 2190
Adobe on Almaden 165
Arena Green / SAP Center 176
Beale at Market 2377
Broadway St at Battery St 2157
California Ave Caltrain Station 127
Castro Street and El Camino Real 339

... (60 rows omitted)

[85]:
trips.group("Start Station", np.mean).select(0,2)
[85]:
Start Station Duration mean
2nd at Folsom 512.887
2nd at South Park 654.565
2nd at Townsend 755.176
5th at Howard 819.509
Adobe on Almaden 2522.5
Arena Green / SAP Center 1999.7
Beale at Market 679.602
Broadway St at Battery St 827.753
California Ave Caltrain Station 4403.29
Castro Street and El Camino Real 1221.86

... (60 rows omitted)

[86]:
trips.group("Start Station").sort("count", descending = True)
[86]:
Start Station count
San Francisco Caltrain (Townsend at 4th) 7426
San Francisco Caltrain 2 (330 Townsend) 6114
Harry Bridges Plaza (Ferry Building) 4795
Temporary Transbay Terminal (Howard at Beale) 4212
Townsend at 7th 3925
2nd at Townsend 3904
Embarcadero at Sansome 3900
Steuart at Market 3872
Market at 10th 3370
Market at Sansome 3218

... (60 rows omitted)

[87]:
trips.group(['Start Station', 'End Station'])
[87]:
Start Station End Station count
2nd at Folsom 2nd at Folsom 22
2nd at Folsom 2nd at South Park 84
2nd at Folsom 2nd at Townsend 123
2nd at Folsom 5th at Howard 28
2nd at Folsom Beale at Market 34
2nd at Folsom Broadway St at Battery St 18
2nd at Folsom Civic Center BART (7th at Market) 13
2nd at Folsom Clay at Battery 70
2nd at Folsom Commercial at Montgomery 46
2nd at Folsom Davis at Jackson 8

... (1616 rows omitted)

tbl.pivot()

tbl.pivot(col1, col2)
tbl.pivot(col1, col2, values, collect)

Creates a pivot table with values in col1 as columns and values in col2 as rows. If values is unspecified, the values in the cells default to counts. If values is specified, it should be the label of a column whose values to pass as an array to collect, which should return a single value.

[88]:
more_cones = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
    'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
    'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)

more_cones
[88]:
Flavor Color Price
strawberry pink 3.55
chocolate light brown 4.75
chocolate dark brown 5.25
strawberry pink 5.25
chocolate dark brown 5.25
bubblegum pink 4.75
[89]:
more_cones.pivot('Flavor', 'Color')
[89]:
Color bubblegum chocolate strawberry
dark brown 0 2 0
light brown 0 1 0
pink 1 0 2
[90]:
more_cones.pivot('Flavor', 'Color', values='Price', collect=sum)
[90]:
Color bubblegum chocolate strawberry
dark brown 0 10.5 0
light brown 0 4.75 0
pink 4.75 0 8.8
[91]:
more_cones.pivot(0, 1)
[91]:
Color bubblegum chocolate strawberry
dark brown 0 2 0
light brown 0 1 0
pink 1 0 2

tbl.join()

tbl1.join(col1, tbl2)
tbl1.join(col1, tbl2, col2)

Performs a join of tbl1 on tbl2 where rows are only included if the value in col1 is present in both join columns. If col2 is unspecified, it is assumed to be the same label as col1.

[92]:
cones = Table().with_columns(
    'Flavor', make_array('strawberry', 'vanilla', 'chocolate', 'strawberry', 'chocolate'),
    'Price', make_array(3.55, 4.75, 6.55, 5.25, 5.75)
)
cones
[92]:
Flavor Price
strawberry 3.55
vanilla 4.75
chocolate 6.55
strawberry 5.25
chocolate 5.75
[95]:
ratings = Table().with_columns(
    'Kind', make_array('strawberry', 'chocolate', 'vanilla', 'mint chip'),
    'Stars', make_array(2.5, 3.5, 4, 3)
)
ratings
[95]:
Kind Stars
strawberry 2.5
chocolate 3.5
vanilla 4
mint chip 3
[97]:
# Joins cones on ratings. Note that the mint chip flavor doesn't appear since it's not in cones
rated = cones.join('Flavor', ratings, 'Kind')
rated
[97]:
Flavor Price Stars
chocolate 6.55 3.5
chocolate 5.75 3.5
strawberry 3.55 2.5
strawberry 5.25 2.5
vanilla 4.75 4

tbl.sample()

tbl.sample(n, with_replacement=True)

Returns a new table with n rows that were randomly sampled from the original table. If with_replacement is true, sampling occurs with replacement. For sampling without replacement, set with_replacement=False.

[98]:
# if you rerun this cell, you should get different results since the sample is random
rated.sample(2)
[98]:
Flavor Price Stars
chocolate 6.55 3.5
chocolate 6.55 3.5

Notice how the table below has more rows for certain flavors than the original rated table. This is because we are sampling with replacement, so you get theoretically get 5 of the same flavors!

[99]:
sampled_with_replacement = rated.sample(5)
sampled_with_replacement
[99]:
Flavor Price Stars
strawberry 5.25 2.5
strawberry 3.55 2.5
strawberry 3.55 2.5
chocolate 6.55 3.5
vanilla 4.75 4
[100]:
rated.sample(3, with_replacement = False)
[100]:
Flavor Price Stars
vanilla 4.75 4
strawberry 3.55 2.5
chocolate 6.55 3.5

String Methods

str.split()

string.split(separator)

Splits the string string into a list on each occurence of the substring separator. The occurences of separator are removed from the resulting list.

For example, the code below splits the string Data 8hiishifun. on the substring hi.

[101]:
example_string = "Data 8hiishifun."
example_string.split("hi")
[101]:
['Data 8', 'is', 'fun.']
[104]:
# split on .
another_string = "the.secret.message.is.123"
another_string.split(".")
[104]:
['the', 'secret', 'message', 'is', '123']

str.join()

string.join(array)

Combines each element of array into one string with string used to connect each element.

[105]:
fun_array = make_array("high", "great", "best")
"est ".join(fun_array)
[105]:
'highest greatest best'
[106]:
# you can join elements on the empty string to just merge the elements
some_strings = make_array("some", "list", "of", "strings")
"".join(some_strings)
[106]:
'somelistofstrings'

str.replace()

string.replace(old_string, new_string)

Replaces each occurrence of old_string in string with new_string.

[107]:
berkeley_string = "I saw 5 friends, 10 squirrels, and 20 people flyering on Sproul."
berkeley_string
[107]:
'I saw 5 friends, 10 squirrels, and 20 people flyering on Sproul.'
[108]:
berkeley_string.replace("friends", "frisbees")
[108]:
'I saw 5 frisbees, 10 squirrels, and 20 people flyering on Sproul.'
[110]:
# you can chain calls to .replace() since the return value is also a string
berkeley_string.replace("friends", "frisbees").replace("flyering on Sproul", "having a picnic on the Glade")
[110]:
'I saw 5 frisbees, 10 squirrels, and 20 people having a picnic on the Glade.'

Array Functions and Methods

[111]:
example_array = make_array(1, 3, 5, 7, 9)
example_array
[111]:
array([1, 3, 5, 7, 9])

max()

max(array)

Returns the maximum value of an array.

[112]:
max(example_array)
[112]:
9

min()

min(array)

Returns the minimum value of an array.

[113]:
min(example_array)
[113]:
1

sum()

sum(array)

Returns the sum of values in an array.

[114]:
sum(example_array)
[114]:
25
[115]:
sum(make_array(1, 2, 0, -10))
[115]:
-7

abs()

abs(num)
abs(array)

Take the absolute value of number or each number in an array.

[118]:
abs(-1)
[118]:
1
[119]:
new_arr = make_array(-3, -1, 5.2, 0.25, -4.9)
abs(new_arr)
[119]:
array([3.  , 1.  , 5.2 , 0.25, 4.9 ])

round(num)

round(num)
round(num, d)
np.round(array)
np.round(array, d)

Round number or array of numbers to the nearest integer. If d is specified, rounds to d places after the decimal. Use np.round to round arrays.

[124]:
round(3.14159)
[124]:
3
[125]:
round(3.14159, 3)
[125]:
3.142
[130]:
np.round(new_arr, 1)
[130]:
array([-3. , -1. ,  5.2,  0.2, -4.9])

len()

len(array)

Returns the length of an array.

[131]:
len(new_arr)
[131]:
5

make_array()

make_array(val1, val2, ...)

Creates a new array with the values passed.

[132]:
new_array = make_array(25, 16, 9, 4, 1)
new_array
[132]:
array([25, 16,  9,  4,  1])

np.mean

np.mean(array)
np.average(array)

Returns the mean of the values in an array.

[134]:
np.mean(new_array)
[134]:
11.0
[133]:
np.average(new_array)
[133]:
11.0

np.std()

np.std(array)

Returns the standard deviation of the values in an array.

[150]:
np.std(new_array)
[150]:
8.648699324175862

np.diff()

np.diff(array)

Returns an array with the pairwise differences between elements in the input array. The output will have length len(array) - 1 and will have elements \(x_1 - x_0\), \(x_2 - x_1\), \(x_3 - x_2\), etc.

[135]:
np.diff(new_array)
[135]:
array([-9, -7, -5, -3])
[136]:
np.diff(make_array(1, 3, 5, 7))
[136]:
array([2, 2, 2])

np.sqrt()

np.sqrt(num)
np.sqrt(array)

Returns the square root of a number or an array of the square roots of each element in the input array.

[137]:
np.sqrt(4)
[137]:
2.0
[138]:
np.sqrt(new_array)
[138]:
array([5., 4., 3., 2., 1.])

np.arange()

np.arange(stop)
np.arange(start, stop)
np.arange(start, stop, step)

Returns an array of integers from start to stop incrementing by step. If start is unspecified, it is assumed be 0. If step is unspecified, it is assumed to be 1. The upper bound is exclusive, meaning that max(np.arange(10)) is 9.

[139]:
np.arange(0, 11)
[139]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
[140]:
np.arange(5)
[140]:
array([0, 1, 2, 3, 4])
[142]:
np.arange(0, 102, 2.7)
[142]:
array([ 0. ,  2.7,  5.4,  8.1, 10.8, 13.5, 16.2, 18.9, 21.6, 24.3, 27. ,
       29.7, 32.4, 35.1, 37.8, 40.5, 43.2, 45.9, 48.6, 51.3, 54. , 56.7,
       59.4, 62.1, 64.8, 67.5, 70.2, 72.9, 75.6, 78.3, 81. , 83.7, 86.4,
       89.1, 91.8, 94.5, 97.2, 99.9])

array.item()

array.item(num)

Returns the item at index num in an array (remember Python indices start at 0!).

[143]:
np.arange(0, 102, 2).item(1)
[143]:
2
[146]:
new_array.item(2)
[146]:
9
[147]:
new_array.item(len(new_array) - 1)
[147]:
1

np.random.choice

np.random.choice(array)
np.random.choice(array, n, replace=True)

Picks one or n of items from an array at random. By default, with replacement (set replace=False for without replacement).

[149]:
np.random.choice(new_array)
[149]:
25
[150]:
np.random.choice(new_array, 3)
[150]:
array([ 4,  4, 16])
[152]:
np.random.choice(np.arange(0, 102, 2), 10, replace=False)
[152]:
array([ 98,  22,  12,  56,  24,  54, 100,  52,  28,  88])

np.count_nonzero()

Returns the number of nonzero elements in an array. Because False values are considered zeros (as integers), this can also give you the number of Trues in an array of boolean valuea.

[153]:
another_array = make_array(0, 1, 2, 0, 4, 0, 1, 0, 0)
np.count_nonzero(another_array)
[153]:
4
[159]:
bools = make_array(True, False, True, False, False, True, False)
np.count_nonzero(bools)
[159]:
3

np.append()

np.append(array, item)

Returns a copy of the input array with item (must be the same type as the other entries in the array) appended to the end.

[160]:
new_array
[160]:
array([25, 16,  9,  4,  1])
[161]:
np.append(new_array, 1000)
[161]:
array([  25,   16,    9,    4,    1, 1000])

percentile()

percentile(percent, array)

Returns the value corresponding to the specified percentile of an array. percent should be in percentage form (i.e. 50 not 0.5).

[162]:
long_array = make_array(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
long_array
[162]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
[163]:
percentile(50, long_array)
[163]:
2
[164]:
percentile(90, long_array)
[164]:
3

Table.where Predicates

All of the predicates described below can be negated by preceding the name with not_. For example, we can find values not equal to a specific value using are.not_equal_to(value).

are.equal_to()

tbl.where(column, are.equal_to(value))

Filter leaves rows only where the value in column is equal to value.

[166]:
trips.where("Duration", are.equal_to(519))
[166]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
723352 519 4/13/2015 17:04 Howard at 2nd 63 4/13/2015 17:12 San Francisco Caltrain (Townsend at 4th) 70 629 Subscriber 94061
824979 519 6/27/2015 15:02 Japantown 9 6/27/2015 15:11 San Jose City Hall 10 660 Customer nil
439946 519 9/5/2014 12:38 Yerba Buena Center of the Arts (3rd @ Howard) 68 9/5/2014 12:47 Civic Center BART (7th at Market) 72 452 Subscriber 94105
788261 519 6/1/2015 9:21 Powell at Post (Union Square) 71 6/1/2015 9:30 Steuart at Market 74 575 Subscriber 94108
560479 519 11/28/2014 14:20 South Van Ness at Market 66 11/28/2014 14:29 Powell at Post (Union Square) 71 609 Subscriber 94108
653797 519 2/23/2015 8:47 Market at 10th 67 2/23/2015 8:55 Yerba Buena Center of the Arts (3rd @ Howard) 68 546 Subscriber 94102
887134 519 8/12/2015 17:29 Civic Center BART (7th at Market) 72 8/12/2015 17:38 Mechanics Plaza (Market at Battery) 75 313 Subscriber 94103
482225 519 10/3/2014 16:41 Spear at Folsom 49 10/3/2014 16:50 Broadway St at Battery St 82 209 Subscriber 94133
681697 519 3/14/2015 10:19 Embarcadero at Sansome 60 3/14/2015 10:28 Harry Bridges Plaza (Ferry Building) 50 368 Customer 95120
912821 519 8/31/2015 17:00 Embarcadero at Folsom 51 8/31/2015 17:09 San Francisco Caltrain (Townsend at 4th) 70 354 Subscriber 94085

... (115 rows omitted)

are.above()

tbl.where(column, are.above(value))

Filter leaves rows only where the value in column is strictly greater than value.

[167]:
trips.where("Duration", are.above(1000))
[167]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
710070 11460 4/2/2015 18:13 Powell Street BART 39 4/2/2015 21:24 Powell Street BART 39 375 Subscriber 94107
589964 15097 1/3/2015 15:22 Embarcadero at Sansome 60 1/3/2015 19:33 Golden Gate at Polk 59 599 Customer 29225
831509 1057 7/2/2015 10:14 2nd at Folsom 62 7/2/2015 10:31 South Van Ness at Market 66 631 Subscriber 94114
442750 6084 9/8/2014 13:32 Embarcadero at Sansome 60 9/8/2014 15:14 Embarcadero at Sansome 60 368 Customer 474454
608714 19799 1/18/2015 10:07 University and Emerson 35 1/18/2015 15:37 San Francisco Caltrain (Townsend at 4th) 70 686 Customer nil
711961 1026 4/4/2015 7:07 Davis at Jackson 42 4/4/2015 7:24 Harry Bridges Plaza (Ferry Building) 50 189 Subscriber 94111
833071 2314 7/4/2015 11:13 Market at 4th 76 7/4/2015 11:52 Washington at Kearny 46 602 Customer 94806
570731 1218 12/8/2014 23:51 MLK Library 11 12/9/2014 0:12 St James Park 13 299 Customer 95033
853698 1048 7/20/2015 10:53 Broadway St at Battery St 82 7/20/2015 11:10 Embarcadero at Sansome 60 636 Customer 91436
787510 3670 5/31/2015 10:47 Mountain View City Hall 27 5/31/2015 11:48 Castro Street and El Camino Real 32 713 Customer 94041

... (11576 rows omitted)

are.above_or_equal_to()

tbl.where(column, are.above_or_equal_to(value))

Filter leaves rows only where the value in column is greater than or equal to value.

[168]:
trips.where("Duration", are.above_or_equal_to(1000))
[168]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
710070 11460 4/2/2015 18:13 Powell Street BART 39 4/2/2015 21:24 Powell Street BART 39 375 Subscriber 94107
589964 15097 1/3/2015 15:22 Embarcadero at Sansome 60 1/3/2015 19:33 Golden Gate at Polk 59 599 Customer 29225
831509 1057 7/2/2015 10:14 2nd at Folsom 62 7/2/2015 10:31 South Van Ness at Market 66 631 Subscriber 94114
442750 6084 9/8/2014 13:32 Embarcadero at Sansome 60 9/8/2014 15:14 Embarcadero at Sansome 60 368 Customer 474454
608714 19799 1/18/2015 10:07 University and Emerson 35 1/18/2015 15:37 San Francisco Caltrain (Townsend at 4th) 70 686 Customer nil
711961 1026 4/4/2015 7:07 Davis at Jackson 42 4/4/2015 7:24 Harry Bridges Plaza (Ferry Building) 50 189 Subscriber 94111
833071 2314 7/4/2015 11:13 Market at 4th 76 7/4/2015 11:52 Washington at Kearny 46 602 Customer 94806
570731 1218 12/8/2014 23:51 MLK Library 11 12/9/2014 0:12 St James Park 13 299 Customer 95033
853698 1048 7/20/2015 10:53 Broadway St at Battery St 82 7/20/2015 11:10 Embarcadero at Sansome 60 636 Customer 91436
787510 3670 5/31/2015 10:47 Mountain View City Hall 27 5/31/2015 11:48 Castro Street and El Camino Real 32 713 Customer 94041

... (11597 rows omitted)

are.below()

tbl.where(column, are.below(value))

Filter leaves rows only where the value in column is strictly less than value.

[170]:
trips.where("Duration", are.below(100))
[170]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
482797 65 10/4/2014 7:50 San Francisco Caltrain (Townsend at 4th) 70 10/4/2014 7:52 San Francisco Caltrain (Townsend at 4th) 70 430 Subscriber 95112
483052 81 10/4/2014 13:52 Harry Bridges Plaza (Ferry Building) 50 10/4/2014 13:53 Harry Bridges Plaza (Ferry Building) 50 306 Customer nan
569620 84 12/8/2014 10:09 Civic Center BART (7th at Market) 72 12/8/2014 10:10 Civic Center BART (7th at Market) 72 326 Subscriber 94111
502332 79 10/16/2014 17:26 Beale at Market 56 10/16/2014 17:27 Temporary Transbay Terminal (Howard at Beale) 55 613 Subscriber 94602
604012 76 1/14/2015 15:18 Davis at Jackson 42 1/14/2015 15:19 Broadway St at Battery St 82 601 Subscriber 94107
704918 70 3/30/2015 22:51 Broadway St at Battery St 82 3/30/2015 22:52 Broadway St at Battery St 82 394 Subscriber 94107
513458 83 10/24/2014 8:50 2nd at Folsom 62 10/24/2014 8:51 Howard at 2nd 63 569 Subscriber 94107
696725 94 3/25/2015 8:47 Post at Kearny 47 3/25/2015 8:49 Washington at Kearny 46 516 Subscriber 94109
829817 86 7/1/2015 9:27 Market at Sansome 77 7/1/2015 9:28 2nd at South Park 64 292 Subscriber 94538
745895 73 4/29/2015 13:05 Yerba Buena Center of the Arts (3rd @ Howard) 68 4/29/2015 13:06 Yerba Buena Center of the Arts (3rd @ Howard) 68 380 Subscriber 94947

... (403 rows omitted)

are.below_or_equal_to()

tbl.where(column, are.below_or_equal_to(value))

Filter leaves rows only where the value in column is less than or equal to value.

[171]:
trips.where("Duration", are.below_or_equal_to(100))
[171]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
482797 65 10/4/2014 7:50 San Francisco Caltrain (Townsend at 4th) 70 10/4/2014 7:52 San Francisco Caltrain (Townsend at 4th) 70 430 Subscriber 95112
483052 81 10/4/2014 13:52 Harry Bridges Plaza (Ferry Building) 50 10/4/2014 13:53 Harry Bridges Plaza (Ferry Building) 50 306 Customer nan
569620 84 12/8/2014 10:09 Civic Center BART (7th at Market) 72 12/8/2014 10:10 Civic Center BART (7th at Market) 72 326 Subscriber 94111
502332 79 10/16/2014 17:26 Beale at Market 56 10/16/2014 17:27 Temporary Transbay Terminal (Howard at Beale) 55 613 Subscriber 94602
604012 76 1/14/2015 15:18 Davis at Jackson 42 1/14/2015 15:19 Broadway St at Battery St 82 601 Subscriber 94107
704918 70 3/30/2015 22:51 Broadway St at Battery St 82 3/30/2015 22:52 Broadway St at Battery St 82 394 Subscriber 94107
513458 83 10/24/2014 8:50 2nd at Folsom 62 10/24/2014 8:51 Howard at 2nd 63 569 Subscriber 94107
696725 94 3/25/2015 8:47 Post at Kearny 47 3/25/2015 8:49 Washington at Kearny 46 516 Subscriber 94109
808199 100 6/15/2015 20:57 Post at Kearny 47 6/15/2015 20:58 2nd at South Park 64 537 Subscriber 94107
829817 86 7/1/2015 9:27 Market at Sansome 77 7/1/2015 9:28 2nd at South Park 64 292 Subscriber 94538

... (430 rows omitted)

are.between()

tbl.where(column, are.between(x, y))

Filter leaves rows only where the value in column is greather than or equal to x and less than y (i.e. in the interval \([x, y)\)).

[172]:
trips.where("Duration", are.between(100, 200))
[172]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
437830 151 9/4/2014 9:13 Grant Avenue at Columbus Avenue 73 9/4/2014 9:15 Commercial at Montgomery 45 306 Subscriber 94104
436255 195 9/3/2014 11:53 2nd at Folsom 62 9/3/2014 11:57 Spear at Folsom 49 403 Subscriber 94107
585884 151 12/26/2014 13:34 Broadway St at Battery St 82 12/26/2014 13:37 Harry Bridges Plaza (Ferry Building) 50 576 Subscriber 94107
548322 191 11/17/2014 20:10 Yerba Buena Center of the Arts (3rd @ Howard) 68 11/17/2014 20:13 Market at Sansome 77 29 Subscriber 94705
594999 185 1/7/2015 17:53 San Antonio Caltrain Station 29 1/7/2015 17:56 San Antonio Shopping Center 31 176 Subscriber 94040
468534 194 9/24/2014 19:08 Mechanics Plaza (Market at Battery) 75 9/24/2014 19:11 Harry Bridges Plaza (Ferry Building) 50 443 Subscriber 94107
873710 169 8/3/2015 17:20 Broadway St at Battery St 82 8/3/2015 17:23 Embarcadero at Sansome 60 532 Subscriber 94114
853087 168 7/20/2015 7:27 Temporary Transbay Terminal (Howard at Beale) 55 7/20/2015 7:30 2nd at Folsom 62 418 Subscriber 94602
863019 162 7/27/2015 8:31 Temporary Transbay Terminal (Howard at Beale) 55 7/27/2015 8:34 Mechanics Plaza (Market at Battery) 75 504 Subscriber 94111
883134 173 8/10/2015 15:11 Embarcadero at Folsom 51 8/10/2015 15:14 Beale at Market 56 363 Subscriber 94117

... (5083 rows omitted)

are.between_or_equal_to()

tbl.where(column, are.between_or_equal_to(x, y))

Filter leaves rows only where the value in column is between or equal to x and y (i.e. in the interval \([x, y]\)).

[173]:
trips.where("Duration", are.between_or_equal_to(100, 200))
[173]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
437830 151 9/4/2014 9:13 Grant Avenue at Columbus Avenue 73 9/4/2014 9:15 Commercial at Montgomery 45 306 Subscriber 94104
436255 195 9/3/2014 11:53 2nd at Folsom 62 9/3/2014 11:57 Spear at Folsom 49 403 Subscriber 94107
585884 151 12/26/2014 13:34 Broadway St at Battery St 82 12/26/2014 13:37 Harry Bridges Plaza (Ferry Building) 50 576 Subscriber 94107
548322 191 11/17/2014 20:10 Yerba Buena Center of the Arts (3rd @ Howard) 68 11/17/2014 20:13 Market at Sansome 77 29 Subscriber 94705
903735 200 8/25/2015 7:59 Temporary Transbay Terminal (Howard at Beale) 55 8/25/2015 8:02 Steuart at Market 74 453 Subscriber 94501
594999 185 1/7/2015 17:53 San Antonio Caltrain Station 29 1/7/2015 17:56 San Antonio Shopping Center 31 176 Subscriber 94040
468534 194 9/24/2014 19:08 Mechanics Plaza (Market at Battery) 75 9/24/2014 19:11 Harry Bridges Plaza (Ferry Building) 50 443 Subscriber 94107
873710 169 8/3/2015 17:20 Broadway St at Battery St 82 8/3/2015 17:23 Embarcadero at Sansome 60 532 Subscriber 94114
853087 168 7/20/2015 7:27 Temporary Transbay Terminal (Howard at Beale) 55 7/20/2015 7:30 2nd at Folsom 62 418 Subscriber 94602
863019 162 7/27/2015 8:31 Temporary Transbay Terminal (Howard at Beale) 55 7/27/2015 8:34 Mechanics Plaza (Market at Battery) 75 504 Subscriber 94111

... (5180 rows omitted)

are.contained_in()

tbl.where(column, are.contained_in(string_or_array))

Filter leaves rows only where the value in column is a substring of string_or_array if it is a string or an element of string_or_array if it is an array

[176]:
trips.where("Start Station", are.contained_in("2nd at Folsom San Antonio Caltrain Station"))
[176]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
436255 195 9/3/2014 11:53 2nd at Folsom 62 9/3/2014 11:57 Spear at Folsom 49 403 Subscriber 94107
831509 1057 7/2/2015 10:14 2nd at Folsom 62 7/2/2015 10:31 South Van Ness at Market 66 631 Subscriber 94114
877160 306 8/5/2015 16:33 2nd at Folsom 62 8/5/2015 16:39 Beale at Market 56 527 Subscriber 94602
768619 840 5/15/2015 11:35 2nd at Folsom 62 5/15/2015 11:49 Market at 10th 67 604 Subscriber 94903
594999 185 1/7/2015 17:53 San Antonio Caltrain Station 29 1/7/2015 17:56 San Antonio Shopping Center 31 176 Subscriber 94040
701211 252 3/27/2015 16:26 2nd at Folsom 62 3/27/2015 16:30 Spear at Folsom 49 405 Subscriber 94105
487432 561 10/7/2014 17:48 2nd at Folsom 62 10/7/2014 17:58 Commercial at Montgomery 45 342 Subscriber 94107
610970 808 1/20/2015 13:28 2nd at Folsom 62 1/20/2015 13:42 Harry Bridges Plaza (Ferry Building) 50 310 Subscriber 94025
753668 196 5/5/2015 11:48 2nd at Folsom 62 5/5/2015 11:52 Temporary Transbay Terminal (Howard at Beale) 55 533 Subscriber 94973
466551 222 9/23/2014 18:12 2nd at Folsom 62 9/23/2014 18:16 2nd at Townsend 61 620 Subscriber 94107

... (2578 rows omitted)

[178]:
trips.where("Start Terminal", are.contained_in(make_array(62, 29)))
[178]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
436255 195 9/3/2014 11:53 2nd at Folsom 62 9/3/2014 11:57 Spear at Folsom 49 403 Subscriber 94107
831509 1057 7/2/2015 10:14 2nd at Folsom 62 7/2/2015 10:31 South Van Ness at Market 66 631 Subscriber 94114
877160 306 8/5/2015 16:33 2nd at Folsom 62 8/5/2015 16:39 Beale at Market 56 527 Subscriber 94602
768619 840 5/15/2015 11:35 2nd at Folsom 62 5/15/2015 11:49 Market at 10th 67 604 Subscriber 94903
594999 185 1/7/2015 17:53 San Antonio Caltrain Station 29 1/7/2015 17:56 San Antonio Shopping Center 31 176 Subscriber 94040
701211 252 3/27/2015 16:26 2nd at Folsom 62 3/27/2015 16:30 Spear at Folsom 49 405 Subscriber 94105
487432 561 10/7/2014 17:48 2nd at Folsom 62 10/7/2014 17:58 Commercial at Montgomery 45 342 Subscriber 94107
610970 808 1/20/2015 13:28 2nd at Folsom 62 1/20/2015 13:42 Harry Bridges Plaza (Ferry Building) 50 310 Subscriber 94025
753668 196 5/5/2015 11:48 2nd at Folsom 62 5/5/2015 11:52 Temporary Transbay Terminal (Howard at Beale) 55 533 Subscriber 94973
466551 222 9/23/2014 18:12 2nd at Folsom 62 9/23/2014 18:16 2nd at Townsend 61 620 Subscriber 94107

... (2578 rows omitted)

are.containing()

tbl.where(column, are.containing(value))

Filter leaves rows only where the value in column contains the substring value.

[180]:
trips.where("End Station", are.containing("at"))
[180]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117
723352 519 4/13/2015 17:04 Howard at 2nd 63 4/13/2015 17:12 San Francisco Caltrain (Townsend at 4th) 70 629 Subscriber 94061
524499 431 10/31/2014 16:36 Townsend at 7th 65 10/31/2014 16:43 Civic Center BART (7th at Market) 72 630 Subscriber 94706
518524 389 10/28/2014 8:48 Market at Sansome 77 10/28/2014 8:54 2nd at South Park 64 458 Subscriber 94610
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 6/4/2015 5:36 Embarcadero at Sansome 60 289 Subscriber 94105
681771 895 3/14/2015 11:46 Market at 10th 67 3/14/2015 12:01 Market at 4th 76 416 Subscriber 94107

... (78805 rows omitted)

are.strictly_between()

tbl.where(column, are.strictly_between(x, y))

Filter leaves rows only where the value in column is strictly greater than x and less than y (i.e. in the interval \((x, y)\)).

[181]:
trips.where("Duration", are.strictly_between(100, 200))
[181]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
437830 151 9/4/2014 9:13 Grant Avenue at Columbus Avenue 73 9/4/2014 9:15 Commercial at Montgomery 45 306 Subscriber 94104
436255 195 9/3/2014 11:53 2nd at Folsom 62 9/3/2014 11:57 Spear at Folsom 49 403 Subscriber 94107
585884 151 12/26/2014 13:34 Broadway St at Battery St 82 12/26/2014 13:37 Harry Bridges Plaza (Ferry Building) 50 576 Subscriber 94107
548322 191 11/17/2014 20:10 Yerba Buena Center of the Arts (3rd @ Howard) 68 11/17/2014 20:13 Market at Sansome 77 29 Subscriber 94705
594999 185 1/7/2015 17:53 San Antonio Caltrain Station 29 1/7/2015 17:56 San Antonio Shopping Center 31 176 Subscriber 94040
468534 194 9/24/2014 19:08 Mechanics Plaza (Market at Battery) 75 9/24/2014 19:11 Harry Bridges Plaza (Ferry Building) 50 443 Subscriber 94107
873710 169 8/3/2015 17:20 Broadway St at Battery St 82 8/3/2015 17:23 Embarcadero at Sansome 60 532 Subscriber 94114
853087 168 7/20/2015 7:27 Temporary Transbay Terminal (Howard at Beale) 55 7/20/2015 7:30 2nd at Folsom 62 418 Subscriber 94602
863019 162 7/27/2015 8:31 Temporary Transbay Terminal (Howard at Beale) 55 7/27/2015 8:34 Mechanics Plaza (Market at Battery) 75 504 Subscriber 94111
883134 173 8/10/2015 15:11 Embarcadero at Folsom 51 8/10/2015 15:14 Beale at Market 56 363 Subscriber 94117

... (5056 rows omitted)

Miscellaneous Functions

sample_proportions()

sample_proportions(sample_size, model_proportions)

Samples sample_size objects from the distribution specified by model_proportions. sample_size should be an integer, model_proportions an array of probabilities that sum up to 1. It returns an array with the same size as model_proportions. Each item in the array corresponds to the proportion of times it was sampled out of the sample_size times.

[182]:
sample_proportions(100, [.5, .3, .2])
[182]:
array([0.32, 0.32, 0.36])

minimize()

minimize(function)

This function returns an array of values that minimize function. function should be a function that takes in a certain number of arguments and returns a number. The array returned by minimize is structured such that if each value in the array was passed into function as arguments, it would minimize the output value of function.

[190]:
def f(x, y):
    return 0.47 * x**2 + 1.23 * np.log(y)

minimize(f)
[190]:
array([ 5.17585792, -0.58835469])