Data 8 datascience
Reference¶
This notebook serves as an interactive, Data 8-friendly reference for the datascience
library.
Table
Functions and Methods¶
Table()
¶
Create an empty table, usually to extend with data
[29]:
new_table = Table()
new_table
[29]:
[30]:
type(new_table)
[30]:
datascience.tables.Table
Table.read_table()
¶
Table.read_table(filename)
Creates a table by reading the CSV file named filename
(a string).
[31]:
trips = Table.read_table('https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/trip.csv')
trips
[31]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | 4/13/2015 17:12 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | Subscriber | 94061 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | 10/31/2014 16:43 | Civic Center BART (7th at Market) | 72 | 630 | Subscriber | 94706 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 10/28/2014 8:54 | 2nd at South Park | 64 | 458 | Subscriber | 94610 |
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | 4/2/2015 21:24 | Powell Street BART | 39 | 375 | Subscriber | 94107 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | 6/4/2015 5:36 | Embarcadero at Sansome | 60 | 289 | Subscriber | 94105 |
... (99990 rows omitted)
tbl.with_column
¶
tbl = Table()
tbl.with_column(name, values)
tbl.with_columns(n1, v1, n2, v2,...)
Creates a new table by adding a column with name name
and values values
to another table. name
should be a string and values
should have as many entries as there are rows in the original table. If values
is a single value, then every row of that column has the value values
.
In the examples below, we start with adding a column to the existing table trips
with values
being an array we construct from existing tables.
[32]:
trips.with_column(
"Difference in terminal", abs(trips.column("Start Terminal") - trips.column("End Terminal"))
)
[32]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code | Difference in terminal |
---|---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 | 7 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 | 10 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 | 2 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 | 2 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 | 2 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | 4/13/2015 17:12 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | Subscriber | 94061 | 7 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | 10/31/2014 16:43 | Civic Center BART (7th at Market) | 72 | 630 | Subscriber | 94706 | 7 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 10/28/2014 8:54 | 2nd at South Park | 64 | 458 | Subscriber | 94610 | 13 |
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | 4/2/2015 21:24 | Powell Street BART | 39 | 375 | Subscriber | 94107 | 0 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | 6/4/2015 5:36 | Embarcadero at Sansome | 60 | 289 | Subscriber | 94105 | 6 |
... (99990 rows omitted)
We can also create a new table by adding two new columns with column name followed by the array values.
[33]:
cookies = Table()
cookies = cookies.with_columns(
"Cookie", make_array("Sugar cookies", "Chocolate chip", "Red velvet", "Oatmeal raisin", "Peanut butter"),
"Quantity", make_array(10, 15, 15, 10, 5)
)
cookies
[33]:
Cookie | Quantity |
---|---|
Sugar cookies | 10 |
Chocolate chip | 15 |
Red velvet | 15 |
Oatmeal raisin | 10 |
Peanut butter | 5 |
[34]:
prices = make_array(1.00, 1.50, 1.75, 1.25, 1.00)
cookies = cookies.with_column("Price ($)", prices)
cookies
[34]:
Cookie | Quantity | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
In the last examples, we add a new column Delicious
with one value “yes,” and we see every column has the same value.
[35]:
cookies.with_column("Delicious", "yes")
[35]:
Cookie | Quantity | Price ($) | Delicious |
---|---|---|---|
Sugar cookies | 10 | 1 | yes |
Chocolate chip | 15 | 1.5 | yes |
Red velvet | 15 | 1.75 | yes |
Oatmeal raisin | 10 | 1.25 | yes |
Peanut butter | 5 | 1 | yes |
tbl.column()
¶
tbl.column(column_name_or_index)
Outputs an array of values of the column column_name_or_index
. column_name_or_index
is a string of the column name or number which is the index of the column.
In the examples below, we start with an array of the Cookie
column from the table cookies
first by the column name then by using the index of the column.
[36]:
cookies.column("Cookie")
[36]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
'Peanut butter'], dtype='<U14')
[37]:
cookies.column(0)
[37]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
'Peanut butter'], dtype='<U14')
tbl.num_rows
¶
Computes the number of rows in a table.
[38]:
trips.num_rows
[38]:
100000
[39]:
cookies.num_rows
[39]:
5
tbl.num_columns
¶
Computes the number of columns in a table.
[40]:
trips.num_columns
[40]:
11
[41]:
cookies.num_columns
[41]:
3
tbl.labels
¶
Outputs the column labels in a table.
[42]:
trips.labels
[42]:
('Trip ID',
'Duration',
'Start Date',
'Start Station',
'Start Terminal',
'End Date',
'End Station',
'End Terminal',
'Bike #',
'Subscriber Type',
'Zip Code')
[43]:
cookies.labels
[43]:
('Cookie', 'Quantity', 'Price ($)')
tbl.select()
¶
tbl.select(col1, col2, ...)
Creates a copy of a table with only the selected columns. Each column is the column name as a string or the integer index of the column.
Suppose we want to select the Trip ID
, Duration
, Bike #
, and Zip Code
columns from the trips
table.
[44]:
trips.select("Trip ID", "Duration", "Bike #", "Zip Code")
[44]:
Trip ID | Duration | Bike # | Zip Code |
---|---|---|---|
876419 | 413 | 269 | 94518 |
459672 | 408 | 429 | 94111 |
903647 | 723 | 631 | 94025 |
452829 | 409 | 428 | 94925 |
491023 | 224 | 144 | 94117 |
723352 | 519 | 629 | 94061 |
524499 | 431 | 630 | 94706 |
518524 | 389 | 458 | 94610 |
710070 | 11460 | 375 | 94107 |
793149 | 616 | 289 | 94105 |
... (99990 rows omitted)
Similarly, we can use indexes to select columns. Remember to start indexing at 0.
[45]:
trips.select(0, 1, 8, 10).show(5)
Trip ID | Duration | Bike # | Zip Code |
---|---|---|---|
876419 | 413 | 269 | 94518 |
459672 | 408 | 429 | 94111 |
903647 | 723 | 631 | 94025 |
452829 | 409 | 428 | 94925 |
491023 | 224 | 144 | 94117 |
... (99995 rows omitted)
tbl.drop()
¶
tbl.drop(col1, col2, ...)
Creates a copy of a table without the specified columns. Each column is the column name as a string or integer index.
[46]:
cookies.drop("Quantity")
[46]:
Cookie | Price ($) |
---|---|
Sugar cookies | 1 |
Chocolate chip | 1.5 |
Red velvet | 1.75 |
Oatmeal raisin | 1.25 |
Peanut butter | 1 |
[47]:
trips.drop("End Date", "Subscriber Type")
[47]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Station | End Terminal | Bike # | Zip Code |
---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | Townsend at 7th | 65 | 269 | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | Embarcadero at Sansome | 60 | 429 | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | Market at 10th | 67 | 631 | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | Market at 4th | 76 | 428 | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | San Jose Diridon Caltrain Station | 2 | 144 | 94117 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | 94061 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | Civic Center BART (7th at Market) | 72 | 630 | 94706 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 2nd at South Park | 64 | 458 | 94610 |
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | Powell Street BART | 39 | 375 | 94107 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | Embarcadero at Sansome | 60 | 289 | 94105 |
... (99990 rows omitted)
[48]:
trips.drop(3, 6, 8, 9, 10)
[48]:
Trip ID | Duration | Start Date | Start Terminal | End Date | End Terminal |
---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | 72 | 8/5/2015 8:36 | 65 |
459672 | 408 | 9/18/2014 17:11 | 50 | 9/18/2014 17:17 | 60 |
903647 | 723 | 8/25/2015 7:26 | 69 | 8/25/2015 7:38 | 67 |
452829 | 409 | 9/15/2014 8:29 | 74 | 9/15/2014 8:36 | 76 |
491023 | 224 | 10/9/2014 16:13 | 4 | 10/9/2014 16:17 | 2 |
723352 | 519 | 4/13/2015 17:04 | 63 | 4/13/2015 17:12 | 70 |
524499 | 431 | 10/31/2014 16:36 | 65 | 10/31/2014 16:43 | 72 |
518524 | 389 | 10/28/2014 8:48 | 77 | 10/28/2014 8:54 | 64 |
710070 | 11460 | 4/2/2015 18:13 | 39 | 4/2/2015 21:24 | 39 |
793149 | 616 | 6/4/2015 5:26 | 54 | 6/4/2015 5:36 | 60 |
... (99990 rows omitted)
tbl.relabel()
¶
tbl.relabel(old_label, new_label)
Modifies the table by changing the label of the column named old_label
to new_label
. old_label
can be a string column name or an integer index.
[49]:
cookies
[49]:
Cookie | Quantity | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[50]:
cookies.relabel("Quantity", "Amount remaining")
[50]:
Cookie | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[51]:
cookies.relabel(0, "Type")
[51]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[52]:
cookies
[52]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
tbl.show()
¶
tbl.show(n)
Displays the first n
rows of a table. If no n
is provided, displays all rows.
[53]:
trips.show(5)
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 |
... (99995 rows omitted)
tbl.sort()
¶
tbl.sort(column_name_or_index, descending=False)
Sorts the rows in the table by the values in the column column_name_or_index
in ascending order by default. Set descending=True
to sort in descending order. column_name_or_index
can be a string column label or an integer index.
[54]:
cookies
[54]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[55]:
cookies.sort("Price ($)")
[55]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Peanut butter | 5 | 1 |
Oatmeal raisin | 10 | 1.25 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
[56]:
# sort in descending order
cookies.sort("Amount remaining", descending = True)
[56]:
Type | Amount remaining | Price ($) |
---|---|---|
Red velvet | 15 | 1.75 |
Chocolate chip | 15 | 1.5 |
Oatmeal raisin | 10 | 1.25 |
Sugar cookies | 10 | 1 |
Peanut butter | 5 | 1 |
[57]:
# alphabetical order
cookies.sort(0)
[57]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
Red velvet | 15 | 1.75 |
Sugar cookies | 10 | 1 |
tbl.where()
¶
tbl.where(column, predicate)
Filters the table for rows where the predicate
is true. predicate
should be one of the provided are.<something>
functions. column
can be a string column label or an integer index. A list of available predicates can be found below.
[58]:
cookies.where("Amount remaining", are.above(10))
[58]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
[59]:
cookies.where(0, are.equal_to("Chocolate chip"))
[59]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
[62]:
# if predicate is a value, look for rows where the column == the value
# equivalent to cookies.where(1, are.eual_to(15))
cookies.where(1, 15)
[62]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
[63]:
cookies.where("Price ($)", are.below(1.25))
[63]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Peanut butter | 5 | 1 |
tbl.take()
¶
tbl.take(row_index, ...)
Returns a copy of the table with only the specified rows included. Rows are specified by their integer index, so 0
for the first, 1
for the second, etc.
[64]:
cookies
[64]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[65]:
cookies.take(0)
[65]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
[66]:
cookies.take(cookies.num_rows - 1)
[66]:
Type | Amount remaining | Price ($) |
---|---|---|
Peanut butter | 5 | 1 |
[67]:
cookies.take(0, 1, 2)
[67]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Table
Visualizations¶
[68]:
actors = Table().read_table("https://github.com/data-8/textbook/raw/gh-pages/data/actors.csv")
actors
[68]:
Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
---|---|---|---|---|---|
Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 |
Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415 |
Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 |
Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2 |
Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3 |
Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3 |
Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 |
Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4 |
... (40 rows omitted)
tbl.scatter()
¶
tbl.scatter(x_column, y_column, fit_line=False)
Creates a scatter plot with x_column
on the horizontal axis and y_column
on the vertical axis. These labels can be column names as strings or integer indices. Set fit_line=True
to include a line of best fit for the data. You can find more examples in the textbook.
[71]:
actors.scatter('Number of Movies', 'Total Gross')

[73]:
actors.scatter(2, 3, fit_line=True)

tbl.plot()
¶
tbl.plot(x_column, y_column)
Plot a line graph with x_column
on the horizontal axis and y_column
on the vertical axis. Sorts the table in ascending order by values in x_column
first. x_column
and y_column
can be column names as strings or integer indices.
[74]:
movies_by_year = Table.read_table('https://github.com/data-8/textbook/raw/gh-pages/data/movies_by_year.csv')
movies_by_year.show(3)
Year | Total Gross | Number of Movies | #1 Movie |
---|---|---|---|
2015 | 11128.5 | 702 | Star Wars: The Force Awakens |
2014 | 10360.8 | 702 | American Sniper |
2013 | 10923.6 | 688 | Catching Fire |
... (33 rows omitted)
[75]:
movies_by_year.plot('Year', 'Number of Movies')

tbl.barh()
¶
tbl.barh(categories)
tbl.barh(categories, values)
Plots a horizontal bar chart broken down by categories
as the bars. If values
is unspecified, one bar for each column of the table (except categories
) is plotted. categories
and values
can be column names as strings or integer indices.
[76]:
cookies.barh("Type")

[77]:
cookies.barh("Type", "Amount remaining")

[78]:
cookies.barh(0, 2)

tbl.hist()
¶
tbl.hist(column)
tbl.hist(column, bins=...)
Plot a histogram of the values in column
. Defaults to 10 bins of equal width. If bins
is specified, it can be a number of bins to use (e.g. bins=25
will produce a histogram with 25 bins) or an array of values to use as bins (e.g. bins=make_array(1, 3, 4)
will produce 2 bins: \([1, 3)\) and \([3, 4)\)). column
can be column names as strings or integer indices.
[79]:
actors.hist(3)

[80]:
actors.hist("Gross")

Table.interactive_plots()
¶
Table.interactive_plots()
This function will change from static plots like the ones above to interactive plots made with plotly. If a plotting method has a plotly version, that method will be used instead.
[193]:
Table.interactive_plots()
actors.scatter("Total Gross", "Gross")