Lecture 8¶

In this lecture we focus on visualizing the distribution of data.

In [1]:
from datascience import *
import numpy as np
In [2]:
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True








Categorical Distribution¶

How often does each possible value occur? There are a finite set of values and so we can visualize those counts as a bar chart.

Using the top movies data from the previous lecture.

In [3]:
top_movies = Table.read_table('top_movies_2017.csv')
top_movies
Out[3]:
Title Studio Gross Gross (Adjusted) Year
Gone with the Wind MGM 198676459 1796176700 1939
Star Wars Fox 460998007 1583483200 1977
The Sound of Music Fox 158671368 1266072700 1965
E.T.: The Extra-Terrestrial Universal 435110554 1261085000 1982
Titanic Paramount 658672302 1204368000 1997
The Ten Commandments Paramount 65500000 1164590000 1956
Jaws Universal 260000000 1138620700 1975
Doctor Zhivago MGM 111721910 1103564200 1965
The Exorcist Warner Brothers 232906145 983226600 1973
Snow White and the Seven Dwarves Disney 184925486 969010000 1937

... (190 rows omitted)

Exercise: Compute how many times does each studio appears in the list. (Here we use the group function which we cover in more detail next week. Data8 Reference Page)

In [4]:
toy = Table().with_columns("Pets", make_array("Cat", "Dog", "Dog", "Bird", "Cat"))
toy
Out[4]:
Pets
Cat
Dog
Dog
Bird
Cat
In [5]:
toy.group("Pets")
Out[5]:
Pets count
Bird 1
Cat 2
Dog 2
In [6]:
studio_counts = top_movies.select('Studio').group("Studio")
studio_counts
Out[6]:
Studio count
AVCO 1
Buena Vista 35
Columbia 9
Disney 11
Dreamworks 3
Fox 24
IFC 1
Lionsgate 3
MGM 7
Metro 1

... (13 rows omitted)







Exercise: Construct a bar chart depicting the number of movies from each studio (the "count").

In [7]:
(
    studio_counts
        .sort("count", descending=True)
        .barh("Studio", "count")
)







Exercise: Construct a bar chart containing the percentage of the movies from each studio.

In [8]:
count_col = studio_counts.column('count')
studio_counts = studio_counts.with_column("percent", count_col / count_col.sum() * 100 )
studio_counts
Out[8]:
Studio count percent
AVCO 1 0.5
Buena Vista 35 17.5
Columbia 9 4.5
Disney 11 5.5
Dreamworks 3 1.5
Fox 24 12
IFC 1 0.5
Lionsgate 3 1.5
MGM 7 3.5
Metro 1 0.5

... (13 rows omitted)

In [9]:
(
    studio_counts
    .sort("percent", descending=True)
    .barh("Studio", "percent")
)








Return to Slides







Distributions of Numerical Data¶

The most basic tool for visualizing the distribution of numerical data is the histogram.




In this part of the demo, we are going to examine the age of the top 200 films.

In [10]:
top_movies.take(np.arange(5)) # just a preview
Out[10]:
Title Studio Gross Gross (Adjusted) Year
Gone with the Wind MGM 198676459 1796176700 1939
Star Wars Fox 460998007 1583483200 1977
The Sound of Music Fox 158671368 1266072700 1965
E.T.: The Extra-Terrestrial Universal 435110554 1261085000 1982
Titanic Paramount 658672302 1204368000 1997

Exercise: Add a column containing the age of each movie to the top_movies table.

In [11]:
this_year = 2023
ages = this_year - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)
top_movies
Out[11]:
Title Studio Gross Gross (Adjusted) Year Age
Gone with the Wind MGM 198676459 1796176700 1939 84
Star Wars Fox 460998007 1583483200 1977 46
The Sound of Music Fox 158671368 1266072700 1965 58
E.T.: The Extra-Terrestrial Universal 435110554 1261085000 1982 41
Titanic Paramount 658672302 1204368000 1997 26
The Ten Commandments Paramount 65500000 1164590000 1956 67
Jaws Universal 260000000 1138620700 1975 48
Doctor Zhivago MGM 111721910 1103564200 1965 58
The Exorcist Warner Brothers 232906145 983226600 1973 50
Snow White and the Seven Dwarves Disney 184925486 969010000 1937 86

... (190 rows omitted)

Exercise: Split the "Age" column into the following bins

In [12]:
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 102)
In [13]:
binned_data = top_movies.bin('Age', bins = my_bins)
binned_data
Out[13]:
bin Age count
0 0
5 21
10 17
15 41
25 43
40 57
65 21
102 0

You can also use np.arange to create regular bins of a fixed size or even just specify a number.

In [14]:
top_movies.bin('Age', bins = np.arange(0, 126, 25))
Out[14]:
bin Age count
0 79
25 69
50 42
75 9
100 1
125 0
In [15]:
top_movies.bin('Age', bins = 10)
Out[15]:
bin Age count
6 41
15.6 40
25.2 28
34.8 24
44.4 25
54 19
63.6 13
73.2 6
82.8 3
92.4 1

... (1 rows omitted)








Return to Slides







Histograms¶

We can construct histograms of numerical variables by calling tbl.hist(...) function using our bins.

In [16]:
# Let's make our first histogram!
top_movies.hist('Age', bins = my_bins, unit = 'Year')
In [17]:
# Let's try equally spaced bins instead.
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')
In [18]:
top_movies.hist('Age', bins = 15, unit='Year')
In [19]:
# Let's try not specifying any bins!
top_movies.hist('Age', unit='Year')

Using the interactive plotting tools:

In [20]:
top_movies.ihist('Age', unit='Year')