A Jupyter Notebook is a data-science environment that combines:
The Jupyter environment was created by faculty here at Berkeley (Fernando Perez). These ideas are now in a lot of different technologies (e.g., Google Collab).
Note: In this lecture there is a lot of code. You are not expected to know any of this yet. This is just a preview of the things you will see in the next few weeks.
We can use the tools of data science to study text. For example, here we will do some basic analysis of "Adventures of Huckleberry Finn" (by Mark Twain) and from "Little Women" (by Louisa May Alcott).
Often the first step in data sciences is getting the data. The following is a tiny program to download text from the web.
# A tiny program to download text from the web.
def read_url(url):
from urllib.request import urlopen
import re
return re.sub('\\s+', ' ', urlopen(url).read().decode())
Here we download the books from the data8 textbook website.
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]
Let's look at the text from the first chapter of Huckleberry Finn:
# write some code here
A lot of data science is about transforming data often to produce tables that we can more easily analyze. In this class you will use the Berkeley datascience library to manipulate and data.
from datascience import *
Table().with_column('Chapters', huck_finn_chapters)
Chapters |
---|
I. YOU don't know about me without you have read a book ... |
II. WE went tiptoeing along a path amongst the trees bac ... |
III. WELL, I got a good going-over in the morning from o ... |
IV. WELL, three or four months run along, and it was wel ... |
V. I had shut the door to. Then I turned around and ther ... |
VI. WELL, pretty soon the old man was up and around agai ... |
VII. "GIT up! What you 'bout?" I opened my eyes and look ... |
VIII. THE sun was up so high when I waked that I judged ... |
IX. I wanted to go and look at a place right about the m ... |
X. AFTER breakfast I wanted to talk about the dead man a ... |
... (33 rows omitted)
We will explore data by extracting summaries. For example, we might ask, how often characters appeared in each chapter. We can use snippets of code to answer these questions.
import numpy as np
np.char.count(huck_finn_chapters, 'Tom')
array([ 6, 24, 5, 0, 0, 0, 2, 2, 0, 0, 2, 3, 1, 0, 0, 0, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 4, 19, 15, 14, 18, 9, 32, 11, 11, 8, 30, 6])
np.char.count(huck_finn_chapters, 'Jim')
array([ 0, 16, 0, 8, 0, 0, 0, 22, 11, 19, 4, 20, 9, 6, 16, 28, 0, 10, 13, 18, 1, 0, 9, 5, 0, 0, 0, 1, 3, 5, 17, 0, 5, 17, 18, 23, 4, 27, 10, 13, 0, 12, 6])
We can convert the results of our analysis into more tables.
counts = Table().with_columns([
'Tom', np.char.count(huck_finn_chapters, 'Tom'),
'Jim', np.char.count(huck_finn_chapters, 'Jim'),
'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts
Tom | Jim | Huck |
---|---|---|
6 | 0 | 3 |
24 | 16 | 2 |
5 | 0 | 2 |
0 | 8 | 1 |
0 | 0 | 0 |
0 | 0 | 2 |
2 | 0 | 0 |
2 | 22 | 5 |
0 | 11 | 1 |
0 | 19 | 0 |
... (33 rows omitted)
Plot the cumulative counts: How many times in Chapter 1, how many times in Chapters 1 and 2, and so on.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks="Chapter")
plt.title('Cumulative Number of Times Name Appears');
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/datascience/tables.py:305: FutureWarning: Implicit column method lookup is deprecated. warnings.warn("Implicit column method lookup is deprecated.", FutureWarning)
What can we tell from this visualization? What questions does this raise?
# The chapters of Little Women
Table().with_column('Chapters', little_women_chapters)
Chapters |
---|
ONE PLAYING PILGRIMS "Christmas won't be Christmas witho ... |
TWO A MERRY CHRISTMAS Jo was the first to wake in the gr ... |
THREE THE LAURENCE BOY "Jo! Jo! Where are you?" cried Me ... |
FOUR BURDENS "Oh, dear, how hard it does seem to take up ... |
FIVE BEING NEIGHBORLY "What in the world are you going t ... |
SIX BETH FINDS THE PALACE BEAUTIFUL The big house did pr ... |
SEVEN AMY'S VALLEY OF HUMILIATION "That boy is a perfect ... |
EIGHT JO MEETS APOLLYON "Girls, where are you going?" as ... |
NINE MEG GOES TO VANITY FAIR "I do think it was the most ... |
TEN THE P.C. AND P.O. As spring came on, a new set of am ... |
... (37 rows omitted)
We can explore the characters in Little Women using the same kind of analysis.
# Counts of names in the chapters of Little Women
names = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
mentions = {name: np.char.count(little_women_chapters, name) for name in names}
counts = Table().with_columns([
'Amy', mentions['Amy'],
'Beth', mentions['Beth'],
'Jo', mentions['Jo'],
'Laurie', mentions['Laurie'],
'Meg', mentions['Meg']
])
# Plot the cumulative counts
Table.static_plots()
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plt.title('Cumulative Number of Times Name Appears');
We can use interactive tools.
# Plot the cumulative counts
Table.interactive_plots()
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)