In this lecture we focus on visualizing the distribution of data.
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True
How often does each possible value occur? There are a finite set of values and so we can visualize those counts as a bar chart.
Using the top movies data from the previous lecture.
top_movies = Table.read_table('top_movies_2017.csv')
top_movies
Title | Studio | Gross | Gross (Adjusted) | Year |
---|---|---|---|---|
Gone with the Wind | MGM | 198676459 | 1796176700 | 1939 |
Star Wars | Fox | 460998007 | 1583483200 | 1977 |
The Sound of Music | Fox | 158671368 | 1266072700 | 1965 |
E.T.: The Extra-Terrestrial | Universal | 435110554 | 1261085000 | 1982 |
Titanic | Paramount | 658672302 | 1204368000 | 1997 |
The Ten Commandments | Paramount | 65500000 | 1164590000 | 1956 |
Jaws | Universal | 260000000 | 1138620700 | 1975 |
Doctor Zhivago | MGM | 111721910 | 1103564200 | 1965 |
The Exorcist | Warner Brothers | 232906145 | 983226600 | 1973 |
Snow White and the Seven Dwarves | Disney | 184925486 | 969010000 | 1937 |
... (190 rows omitted)
Exercise: Compute how many times does each studio appears in the list. (Here we use the group
function which we cover in more detail next week. Data8 Reference Page)
toy = Table().with_columns("Pets", make_array("Cat", "Dog", "Dog", "Bird", "Cat"))
toy
Pets |
---|
Cat |
Dog |
Dog |
Bird |
Cat |
toy.group("Pets")
Pets | count |
---|---|
Bird | 1 |
Cat | 2 |
Dog | 2 |
studio_counts = top_movies.select('Studio').group("Studio")
studio_counts
Studio | count |
---|---|
AVCO | 1 |
Buena Vista | 35 |
Columbia | 9 |
Disney | 11 |
Dreamworks | 3 |
Fox | 24 |
IFC | 1 |
Lionsgate | 3 |
MGM | 7 |
Metro | 1 |
... (13 rows omitted)
Exercise: Construct a bar chart depicting the number of movies from each studio (the "count"
).
(
studio_counts
.sort("count", descending=True)
.barh("Studio", "count")
)
Exercise: Construct a bar chart containing the percentage of the movies from each studio.
count_col = studio_counts.column('count')
studio_counts = studio_counts.with_column("percent", count_col / count_col.sum() * 100 )
studio_counts
Studio | count | percent |
---|---|---|
AVCO | 1 | 0.5 |
Buena Vista | 35 | 17.5 |
Columbia | 9 | 4.5 |
Disney | 11 | 5.5 |
Dreamworks | 3 | 1.5 |
Fox | 24 | 12 |
IFC | 1 | 0.5 |
Lionsgate | 3 | 1.5 |
MGM | 7 | 3.5 |
Metro | 1 | 0.5 |
... (13 rows omitted)
(
studio_counts
.sort("percent", descending=True)
.barh("Studio", "percent")
)
The most basic tool for visualizing the distribution of numerical data is the histogram.
In this part of the demo, we are going to examine the age of the top 200 films.
top_movies.take(np.arange(5)) # just a preview
Title | Studio | Gross | Gross (Adjusted) | Year |
---|---|---|---|---|
Gone with the Wind | MGM | 198676459 | 1796176700 | 1939 |
Star Wars | Fox | 460998007 | 1583483200 | 1977 |
The Sound of Music | Fox | 158671368 | 1266072700 | 1965 |
E.T.: The Extra-Terrestrial | Universal | 435110554 | 1261085000 | 1982 |
Titanic | Paramount | 658672302 | 1204368000 | 1997 |
Exercise: Add a column containing the age of each movie to the top_movies
table.
this_year = 2023
ages = this_year - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)
top_movies
Title | Studio | Gross | Gross (Adjusted) | Year | Age |
---|---|---|---|---|---|
Gone with the Wind | MGM | 198676459 | 1796176700 | 1939 | 84 |
Star Wars | Fox | 460998007 | 1583483200 | 1977 | 46 |
The Sound of Music | Fox | 158671368 | 1266072700 | 1965 | 58 |
E.T.: The Extra-Terrestrial | Universal | 435110554 | 1261085000 | 1982 | 41 |
Titanic | Paramount | 658672302 | 1204368000 | 1997 | 26 |
The Ten Commandments | Paramount | 65500000 | 1164590000 | 1956 | 67 |
Jaws | Universal | 260000000 | 1138620700 | 1975 | 48 |
Doctor Zhivago | MGM | 111721910 | 1103564200 | 1965 | 58 |
The Exorcist | Warner Brothers | 232906145 | 983226600 | 1973 | 50 |
Snow White and the Seven Dwarves | Disney | 184925486 | 969010000 | 1937 | 86 |
... (190 rows omitted)
Exercise: Split the "Age"
column into the following bins
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 102)
binned_data = top_movies.bin('Age', bins = my_bins)
binned_data
bin | Age count |
---|---|
0 | 0 |
5 | 21 |
10 | 17 |
15 | 41 |
25 | 43 |
40 | 57 |
65 | 21 |
102 | 0 |
You can also use np.arange
to create regular bins of a fixed size or even just specify a number.
top_movies.bin('Age', bins = np.arange(0, 126, 25))
bin | Age count |
---|---|
0 | 79 |
25 | 69 |
50 | 42 |
75 | 9 |
100 | 1 |
125 | 0 |
top_movies.bin('Age', bins = 10)
bin | Age count |
---|---|
6 | 41 |
15.6 | 40 |
25.2 | 28 |
34.8 | 24 |
44.4 | 25 |
54 | 19 |
63.6 | 13 |
73.2 | 6 |
82.8 | 3 |
92.4 | 1 |
... (1 rows omitted)
We can construct histograms of numerical variables by calling tbl.hist(...)
function using our bins
.
# Let's make our first histogram!
top_movies.hist('Age', bins = my_bins, unit = 'Year')
# Let's try equally spaced bins instead.
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')
top_movies.hist('Age', bins = 15, unit='Year')
# Let's try not specifying any bins!
top_movies.hist('Age', unit='Year')
Using the interactive plotting tools:
top_movies.ihist('Age', unit='Year')