Data 8 datascience Reference

This notebook serves as an interactive, Data 8-friendly reference for the datascience library.

Table Functions and Methods

Table()

Create an empty table, usually to extend with data

[29]:
new_table = Table()
new_table
[29]:
[30]:
type(new_table)
[30]:
datascience.tables.Table

Table.read_table()

Table.read_table(filename)

Creates a table by reading the CSV file named filename (a string).

[31]:
trips = Table.read_table('https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/trip.csv')
trips
[31]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117
723352 519 4/13/2015 17:04 Howard at 2nd 63 4/13/2015 17:12 San Francisco Caltrain (Townsend at 4th) 70 629 Subscriber 94061
524499 431 10/31/2014 16:36 Townsend at 7th 65 10/31/2014 16:43 Civic Center BART (7th at Market) 72 630 Subscriber 94706
518524 389 10/28/2014 8:48 Market at Sansome 77 10/28/2014 8:54 2nd at South Park 64 458 Subscriber 94610
710070 11460 4/2/2015 18:13 Powell Street BART 39 4/2/2015 21:24 Powell Street BART 39 375 Subscriber 94107
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 6/4/2015 5:36 Embarcadero at Sansome 60 289 Subscriber 94105

... (99990 rows omitted)

tbl.with_column

tbl = Table()
tbl.with_column(name, values)
tbl.with_columns(n1, v1, n2, v2,...)

Creates a new table by adding a column with name name and values values to another table. name should be a string and values should have as many entries as there are rows in the original table. If values is a single value, then every row of that column has the value values.

In the examples below, we start with adding a column to the existing table trips with values being an array we construct from existing tables.

[32]:
trips.with_column(
    "Difference in terminal", abs(trips.column("Start Terminal") - trips.column("End Terminal"))
)
[32]:
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code Difference in terminal
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518 7
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111 10
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025 2
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925 2
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117 2
723352 519 4/13/2015 17:04 Howard at 2nd 63 4/13/2015 17:12 San Francisco Caltrain (Townsend at 4th) 70 629 Subscriber 94061 7
524499 431 10/31/2014 16:36 Townsend at 7th 65 10/31/2014 16:43 Civic Center BART (7th at Market) 72 630 Subscriber 94706 7
518524 389 10/28/2014 8:48 Market at Sansome 77 10/28/2014 8:54 2nd at South Park 64 458 Subscriber 94610 13
710070 11460 4/2/2015 18:13 Powell Street BART 39 4/2/2015 21:24 Powell Street BART 39 375 Subscriber 94107 0
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 6/4/2015 5:36 Embarcadero at Sansome 60 289 Subscriber 94105 6

... (99990 rows omitted)

We can also create a new table by adding two new columns with column name followed by the array values.

[33]:
cookies = Table()
cookies = cookies.with_columns(
    "Cookie", make_array("Sugar cookies", "Chocolate chip", "Red velvet", "Oatmeal raisin", "Peanut butter"),
    "Quantity", make_array(10, 15, 15, 10, 5)
)
cookies
[33]:
Cookie Quantity
Sugar cookies 10
Chocolate chip 15
Red velvet 15
Oatmeal raisin 10
Peanut butter 5
[34]:
prices = make_array(1.00, 1.50, 1.75, 1.25, 1.00)
cookies = cookies.with_column("Price ($)", prices)
cookies
[34]:
Cookie Quantity Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1

In the last examples, we add a new column Delicious with one value “yes,” and we see every column has the same value.

[35]:
cookies.with_column("Delicious", "yes")
[35]:
Cookie Quantity Price ($) Delicious
Sugar cookies 10 1 yes
Chocolate chip 15 1.5 yes
Red velvet 15 1.75 yes
Oatmeal raisin 10 1.25 yes
Peanut butter 5 1 yes

tbl.column()

tbl.column(column_name_or_index)

Outputs an array of values of the column column_name_or_index. column_name_or_index is a string of the column name or number which is the index of the column.

In the examples below, we start with an array of the Cookie column from the table cookies first by the column name then by using the index of the column.

[36]:
cookies.column("Cookie")
[36]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
       'Peanut butter'], dtype='<U14')
[37]:
cookies.column(0)
[37]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
       'Peanut butter'], dtype='<U14')

tbl.num_rows

Computes the number of rows in a table.

[38]:
trips.num_rows
[38]:
100000
[39]:
cookies.num_rows
[39]:
5

tbl.num_columns

Computes the number of columns in a table.

[40]:
trips.num_columns
[40]:
11
[41]:
cookies.num_columns
[41]:
3

tbl.labels

Outputs the column labels in a table.

[42]:
trips.labels
[42]:
('Trip ID',
 'Duration',
 'Start Date',
 'Start Station',
 'Start Terminal',
 'End Date',
 'End Station',
 'End Terminal',
 'Bike #',
 'Subscriber Type',
 'Zip Code')
[43]:
cookies.labels
[43]:
('Cookie', 'Quantity', 'Price ($)')

tbl.select()

tbl.select(col1, col2, ...)

Creates a copy of a table with only the selected columns. Each column is the column name as a string or the integer index of the column.

Suppose we want to select the Trip ID, Duration, Bike #, and Zip Code columns from the trips table.

[44]:
trips.select("Trip ID", "Duration", "Bike #", "Zip Code")
[44]:
Trip ID Duration Bike # Zip Code
876419 413 269 94518
459672 408 429 94111
903647 723 631 94025
452829 409 428 94925
491023 224 144 94117
723352 519 629 94061
524499 431 630 94706
518524 389 458 94610
710070 11460 375 94107
793149 616 289 94105

... (99990 rows omitted)

Similarly, we can use indexes to select columns. Remember to start indexing at 0.

[45]:
trips.select(0, 1, 8, 10).show(5)
Trip ID Duration Bike # Zip Code
876419 413 269 94518
459672 408 429 94111
903647 723 631 94025
452829 409 428 94925
491023 224 144 94117

... (99995 rows omitted)

tbl.drop()

tbl.drop(col1, col2, ...)

Creates a copy of a table without the specified columns. Each column is the column name as a string or integer index.

[46]:
cookies.drop("Quantity")
[46]:
Cookie Price ($)
Sugar cookies 1
Chocolate chip 1.5
Red velvet 1.75
Oatmeal raisin 1.25
Peanut butter 1
[47]:
trips.drop("End Date", "Subscriber Type")
[47]:
Trip ID Duration Start Date Start Station Start Terminal End Station End Terminal Bike # Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 Townsend at 7th 65 269 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 Embarcadero at Sansome 60 429 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 Market at 10th 67 631 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 Market at 4th 76 428 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 San Jose Diridon Caltrain Station 2 144 94117
723352 519 4/13/2015 17:04 Howard at 2nd 63 San Francisco Caltrain (Townsend at 4th) 70 629 94061
524499 431 10/31/2014 16:36 Townsend at 7th 65 Civic Center BART (7th at Market) 72 630 94706
518524 389 10/28/2014 8:48 Market at Sansome 77 2nd at South Park 64 458 94610
710070 11460 4/2/2015 18:13 Powell Street BART 39 Powell Street BART 39 375 94107
793149 616 6/4/2015 5:26 Embarcadero at Bryant 54 Embarcadero at Sansome 60 289 94105

... (99990 rows omitted)

[48]:
trips.drop(3, 6, 8, 9, 10)
[48]:
Trip ID Duration Start Date Start Terminal End Date End Terminal
876419 413 8/5/2015 8:29 72 8/5/2015 8:36 65
459672 408 9/18/2014 17:11 50 9/18/2014 17:17 60
903647 723 8/25/2015 7:26 69 8/25/2015 7:38 67
452829 409 9/15/2014 8:29 74 9/15/2014 8:36 76
491023 224 10/9/2014 16:13 4 10/9/2014 16:17 2
723352 519 4/13/2015 17:04 63 4/13/2015 17:12 70
524499 431 10/31/2014 16:36 65 10/31/2014 16:43 72
518524 389 10/28/2014 8:48 77 10/28/2014 8:54 64
710070 11460 4/2/2015 18:13 39 4/2/2015 21:24 39
793149 616 6/4/2015 5:26 54 6/4/2015 5:36 60

... (99990 rows omitted)

tbl.relabel()

tbl.relabel(old_label, new_label)

Modifies the table by changing the label of the column named old_label to new_label. old_label can be a string column name or an integer index.

[49]:
cookies
[49]:
Cookie Quantity Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[50]:
cookies.relabel("Quantity", "Amount remaining")
[50]:
Cookie Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[51]:
cookies.relabel(0, "Type")
[51]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[52]:
cookies
[52]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1

tbl.show()

tbl.show(n)

Displays the first n rows of a table. If no n is provided, displays all rows.

[53]:
trips.show(5)
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscriber Type Zip Code
876419 413 8/5/2015 8:29 Civic Center BART (7th at Market) 72 8/5/2015 8:36 Townsend at 7th 65 269 Subscriber 94518
459672 408 9/18/2014 17:11 Harry Bridges Plaza (Ferry Building) 50 9/18/2014 17:17 Embarcadero at Sansome 60 429 Subscriber 94111
903647 723 8/25/2015 7:26 San Francisco Caltrain 2 (330 Townsend) 69 8/25/2015 7:38 Market at 10th 67 631 Subscriber 94025
452829 409 9/15/2014 8:29 Steuart at Market 74 9/15/2014 8:36 Market at 4th 76 428 Subscriber 94925
491023 224 10/9/2014 16:13 Santa Clara at Almaden 4 10/9/2014 16:17 San Jose Diridon Caltrain Station 2 144 Subscriber 94117

... (99995 rows omitted)

tbl.sort()

tbl.sort(column_name_or_index, descending=False)

Sorts the rows in the table by the values in the column column_name_or_index in ascending order by default. Set descending=True to sort in descending order. column_name_or_index can be a string column label or an integer index.

[54]:
cookies
[54]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[55]:
cookies.sort("Price ($)")
[55]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Peanut butter 5 1
Oatmeal raisin 10 1.25
Chocolate chip 15 1.5
Red velvet 15 1.75
[56]:
# sort in descending order
cookies.sort("Amount remaining", descending = True)
[56]:
Type Amount remaining Price ($)
Red velvet 15 1.75
Chocolate chip 15 1.5
Oatmeal raisin 10 1.25
Sugar cookies 10 1
Peanut butter 5 1
[57]:
# alphabetical order
cookies.sort(0)
[57]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
Oatmeal raisin 10 1.25
Peanut butter 5 1
Red velvet 15 1.75
Sugar cookies 10 1

tbl.where()

tbl.where(column, predicate)

Filters the table for rows where the predicate is true. predicate should be one of the provided are.<something> functions. column can be a string column label or an integer index. A list of available predicates can be found below.

[58]:
cookies.where("Amount remaining", are.above(10))
[58]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
Red velvet 15 1.75
[59]:
cookies.where(0, are.equal_to("Chocolate chip"))
[59]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
[62]:
# if predicate is a value, look for rows where the column == the value
# equivalent to cookies.where(1, are.eual_to(15))
cookies.where(1, 15)
[62]:
Type Amount remaining Price ($)
Chocolate chip 15 1.5
Red velvet 15 1.75
[63]:
cookies.where("Price ($)", are.below(1.25))
[63]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Peanut butter 5 1

tbl.take()

tbl.take(row_index, ...)

Returns a copy of the table with only the specified rows included. Rows are specified by their integer index, so 0 for the first, 1 for the second, etc.

[64]:
cookies
[64]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75
Oatmeal raisin 10 1.25
Peanut butter 5 1
[65]:
cookies.take(0)
[65]:
Type Amount remaining Price ($)
Sugar cookies 10 1
[66]:
cookies.take(cookies.num_rows - 1)
[66]:
Type Amount remaining Price ($)
Peanut butter 5 1
[67]:
cookies.take(0, 1, 2)
[67]:
Type Amount remaining Price ($)
Sugar cookies 10 1
Chocolate chip 15 1.5
Red velvet 15 1.75

Table Visualizations

[68]:
actors = Table().read_table("https://github.com/data-8/textbook/raw/gh-pages/data/actors.csv")
actors
[68]:
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Tom Hanks 4340.8 44 98.7 Toy Story 3 415
Robert Downey, Jr. 3947.3 53 74.5 The Avengers 623.4
Eddie Murphy 3810.4 38 100.3 Shrek 2 441.2
Tom Cruise 3587.2 36 99.6 War of the Worlds 234.3
Johnny Depp 3368.6 45 74.9 Dead Man's Chest 423.3
Michael Caine 3351.5 58 57.8 The Dark Knight 534.9
Scarlett Johansson 3341.2 37 90.3 The Avengers 623.4

... (40 rows omitted)

tbl.scatter()

tbl.scatter(x_column, y_column, fit_line=False)

Creates a scatter plot with x_column on the horizontal axis and y_column on the vertical axis. These labels can be column names as strings or integer indices. Set fit_line=True to include a line of best fit for the data. You can find more examples in the textbook.

[71]:
actors.scatter('Number of Movies', 'Total Gross')
../_images/reference-nb_datascience-reference_60_0.png
[73]:
actors.scatter(2, 3, fit_line=True)
../_images/reference-nb_datascience-reference_61_0.png

tbl.plot()

tbl.plot(x_column, y_column)

Plot a line graph with x_column on the horizontal axis and y_column on the vertical axis. Sorts the table in ascending order by values in x_column first. x_column and y_column can be column names as strings or integer indices.

[74]:
movies_by_year = Table.read_table('https://github.com/data-8/textbook/raw/gh-pages/data/movies_by_year.csv')
movies_by_year.show(3)
Year Total Gross Number of Movies #1 Movie
2015 11128.5 702 Star Wars: The Force Awakens
2014 10360.8 702 American Sniper
2013 10923.6 688 Catching Fire

... (33 rows omitted)

[75]:
movies_by_year.plot('Year', 'Number of Movies')
../_images/reference-nb_datascience-reference_64_0.png

tbl.barh()

tbl.barh(categories)
tbl.barh(categories, values)

Plots a horizontal bar chart broken down by categories as the bars. If values is unspecified, one bar for each column of the table (except categories) is plotted. categories and values can be column names as strings or integer indices.

[76]:
cookies.barh("Type")
../_images/reference-nb_datascience-reference_66_0.png
[77]:
cookies.barh("Type", "Amount remaining")
../_images/reference-nb_datascience-reference_67_0.png
[78]:
cookies.barh(0, 2)
../_images/reference-nb_datascience-reference_68_0.png

tbl.hist()

tbl.hist(column)
tbl.hist(column, bins=...)

Plot a histogram of the values in column. Defaults to 10 bins of equal width. If bins is specified, it can be a number of bins to use (e.g. bins=25 will produce a histogram with 25 bins) or an array of values to use as bins (e.g. bins=make_array(1, 3, 4) will produce 2 bins: \([1, 3)\) and \([3, 4)\)). column can be column names as strings or integer indices.

[79]:
actors.hist(3)
../_images/reference-nb_datascience-reference_70_0.png
[80]:
actors.hist("Gross")
../_images/reference-nb_datascience-reference_71_0.png

Table.interactive_plots()

Table.interactive_plots()

This function will change from static plots like the ones above to interactive plots made with plotly. If a plotting method has a plotly version, that method will be used instead.

[193]:
Table.interactive_plots()
actors.scatter("Total Gross", "Gross")