In the previous assignment of downloading and parsing all of the White House Press Briefing pages, hopefully you noticed what a pain it is to pick out pieces of data from raw HTML using just grep. In this assignment, we learn how to use a parser designed for HTML, which will make it much easier to target the section of text that we want on every White House Briefing page.
In your compciv
repo, create a folder named homework/wh-briefings-word-scrape
By the end of this assignment, that folder should contain at least this single script:
html-scraper.sh
script, which counts the top 10 words, 7 characters or more, used in all the WH Press Briefings.It may also contain data-hold/
as part of th eprocess, but data-hold/
won’t actually be committed to your Github repository.
In many ways, the code in this script will be similar to what you did in the previous assignment. However, if you are properly parsing the HTML, you should get a different answer than you would with just grep. In fact, if your script includes “container” as one of the top 10 7-letter-or-longer words, then you probably didn’t use the HTML parser to target the right thing.
After executing html-scraper.sh
, you should get a list of the top 10 words as described above.
Email me that list of the top 10 words, in order of frequency, that are seven letters or longer, used in all of the briefings.
Use the subject line: Top 10 WH Words via Pup
A few things to make this go smoothly.
Do the Github/baby-names warmup homework first.
The data management part of this assignment isn't too difficult. But it follows the same structure (but not exactly…for instance, I don't require you to provide a helper.sh
script) as the baby-names assignment, with a subfolder in compciv/homework
.
Install and try out the pup
parsing tool as outlined in this recipe.
By now, you should have completed the previous assignment in which you've downloaded every White House Press Briefing to date. If you don't remember where you dumped those files, and you don't relish the idea of rescraping the WH press briefings site, then just use my archive:
http://stash.compciv.org/wh/wh-press-briefings-2015-01-07.zip
You should move all of these HTML files into the data-hold
subdirectory of this assignment's homework directory in your compciv repo, i.e.
# assuming you logged into corn.stanford.edu at this point
cd ~/compciv/homework/
mkdir -p ./wh-briefings-word-scrape/data-hold
cd wh-briefings-word-scrape/data-hold
curl http://stash.compciv.org/wh/wh-press-briefings-2015-01-07.zip \
-o briefings.zip
unzip briefings.zip
# you'll notice that in my zip file, all the HTML is in a subdirectory when
# it gets unzipped. So here's how to move those all into data-hold/ proper
mv wh-briefings/* .
# now let's get rid of that zip file and that now-empty subdirectory
rm briefings.zip
rmdir wh-briefings/
# now cd back into the homework assignment directory and work from there
cd ..
If you look at the data I provide, none of the files have an HTML extension. That's fine. If you are working from my data archive, then the following command will show you all the <title>
tags from all the briefings I collected (you may just want to try it on one file, rather than having to wait for pup to burn through 1,300 HTML files during the exploratory phase):
cat data-hold/* | pup 'title'
To see just the title text:
cat data-hold/* | pup 'title text{}'
If you were just interested in the URLs that are on the "right-rail" of each page:
cat data-hold/* | pup '#right-rail a attr{href}'
Somewhere along the line, the White House changed its content management system. Which means the HTML structure for this 2009 briefing, Press Gaggle by Robert Gibbs - 2/18/09, is different from the one for this 2014 briefing, Press Gaggle by Senior Administration Official on Director Clapper's Trip to North Korea
How to fix this problem? You may have to pop open your web browser (I recommend Chrome) to view the source. Then test out pup CSS selectors. You might want to run a different pup CSS selector based on the type of page. Or, try to figure out how to write a CSS selector that includes multiple selections.
Note: This is not a trivial exercise, and can easily be a pain in the ass depending on how much you know about web development. In the end, it is about noticing patterns. I'll probably add extra hints to this section in the next couple of days.
The following options in the grep documentation might help:
-L, --files-without-match
Suppress normal output; instead print the name of each input
file from which no output would normally have been printed. The
scanning will stop on the first match.
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match. (-l is specified by
POSIX.)
Let's say there's something in the raw HTML of either the old format of HTML content, or the new format, and let's say you stored that in a variable named litmus_test
.
The following two commands would help you differentiate between which were press-briefings were published in one kind of format:
grep -L $litmus_test * # or *.html, depending on how you saved your files
And the inverse of that:
grep -l $litmus_test * # or *.html, depending on how you saved your files
Assuming that the files are in data-hold/
and have no extension (such as .html
), the answer is virtually the same as Step 3 in the previous grep assignment. The key here is to understand how the pup parser allows us to extract only the text of a given element. In this case, the text of every briefing could be found in <div id="content">...</div>
…though you'd have a very hard time to use grep for that.
cat data-hold/* | pup '#content text{}' | \
grep -oE '[[:alpha:]]{7,}' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq -c | sort -rn | head -n 10
Results (may vary):
78090 president
15475 because
11852 question
11292 congress
10634 important
10585 security
10577 administration
10548 obviously
10456 american
9831 government
A for-loop was unnecessary here, given how pup can just fit into the pipeline. But if you wanted to practice it, it could've looked like this (assuming that no file names had spaces in them):
for p in $(ls data-hold/*); do
cat $p | pup '#content text{}' | \
grep -oE '[[:alpha:]]{7,}' | \
tr '[:upper:]' '[:lower:]' >> pupwords.txt
done
cat pupwords.txt | sort | uniq -c | sort -rn | head -n 10
In the homework hints, I threw people off with a red-herring. I suggested using grep to verify the hypothesis that the White House briefing pages were split into two groups, pages with legacy-para
and those without. That's what I meant by using a "litmus test":
# How many pages do we have total?
ls data-hold/* | wc -l
# 1343
# assuming old pages have the `legacy-para` element class
# find how many pages via grep have it:
litmus_test='legacy-para'
grep -L $litmus_test data-hold/* | wc -l
# 1120
# and find out how many do _not_ have that term
grep -l $litmus_test data-hold/* | wc -l
# 223
# 223 + 1120 = 1343...so 'legacy-para' is a good way to divide things
# into two pages
At this point, you could grep both categories of pages and use different pup selectors. I assumed you would have to do this, but student Yuqing Pan pointed out that all pages, regardless of if they were legacy pages, had the desired content in the <div id="content">...</div>
element.
In the end, I didn't grade you based on how observant you were of the White House content-management system, so the actual word count you may have gotten doesn't count against you. If you showed you could use pup to parse HTML and extract text, and then combine it with the previous solution, you got full credit.