Download all of the press-briefing listings, starting from http://www.whitehouse.gov/briefing-room/press-briefings?page=0. Then total up the number of lines in all the files.
Send me an email (dun@stanford) with the answer in the subject line:
Number of Lines in WH Briefing: XYZAB
And send that email through the command line, because why not.
Read the lesson on the curl tool for downloading pages from the command-line.
Review Software Carpentry's lessons on Loops.
This assignment is the first of a several-step process to replicate NPR's work at, "The Fleeting Obsessions of the White House Press Corps"
Before we can do the word-count analysis they've done, we need to first collect the webpages of each White House briefing. And before we can even do that, we need to get a list of every briefing.
This is an exercise focused on using for
loops to make a repetitive task easy. We're not actually "scraping" data, in the usual sense. Just downloading lots of webpages for further use.
Since you'll be downloading a lot of files, you'll want to make a new directory.
The following command will make a new directory underneath your home directory (the tilde symbol is a shorthand for that) named mystuff/wh-briefings
:
mkdir -p ~/mystuff/wh-briefings
cd ~/mystuff/wh-briefings
If you do your work from here, you can come back to this directory for future assignments.
The first page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=0
The next page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=1
The 5th (or rather, the 6th, counting 0) is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=10
See a pattern?
So we need a way to generate a list of numbers in sequential order. Luckily, there's the seq
command:
seq 0 5
Results in:
0
1
2
3
4
5
Putting that into a for
construct:
for num in $(seq 0 5); do
echo "Hey this is a number $num"
done
# Output:
Hey this is a number 0
Hey this is a number 1
Hey this is a number 2
Hey this is a number 3
Hey this is a number 4
Hey this is a number 5
Again, read the lesson on the curl tool for downloading pages from the command-line.
To download three copies of example.com
and save them in files 0.html
, 1.html
, 2.html
for num in $(seq 0 2); do
curl http://example.com -o $num.html
done
Of course, we don't want to save three copies of the same website. So use the $num
variable to correctly target the right page in each iteration of the loop.
If you want to get all of the briefings, you need to loop from 0 to whatever the final page is on the WH Briefings. As you get better at programming, you could probably write a program to automatically find this final page. For now, you should do it the old-fashioned way (i.e. entering random numbers into the browser's address bar until you reach the end).
Rather than looping through all of the possible White House pages, and then finding out much later that you didn't do the right thing, try just looping through the first three pages or so.
One of the tricky things about working from the command-line is that not everything is meant to be read as text, including HTML.
If you download the following page:
curl http://www.whitehouse.gov/briefing-room/press-briefings?page=100 -o 100.html
How do you know you downloaded the actual page, and not just an error page? Or something else unexpected?
This is where you go back to doing things as you've done before:
grep
to see if that word exists in the file you downloaded with curl
So the following command should spit out a match:
grep 'Nashville' 100.html
The whitehouse.gov
domain is pretty robust. But let's give it a courtesy couple of seconds between each visit. Use the sleep
command in your for
loop.
If your script worked, you should have a folder, located at ~/mystuff/white-house-briefings
with 100+ HTML files.
To answer the question in the deliverable, i.e. how many lines there are in all of the pages put together…you use the cat
command to join the files together (look up the wildcard symbol you need to specify all of the files in a directory). And then pipe it into the command to count lines (look it up on Google).
The first thing you had to do was figure out how far back the White House press briefings archive goes to, by manually increasing the page
parameter, e.g.:
http://www.whitehouse.gov/briefing-room/press-briefings?page=50 http://www.whitehouse.gov/briefing-room/press-briefings?page=100
One tricky thing was that if you went back too far, the website would, by default, serve you what you get at page=0
.
As of Jan. 7, 2015, the highest page number was 134
Here's a verbose version of the URL-scraper, with comments:
#
base_url=http://www.whitehouse.gov/briefing-room/press-briefings
# set the last page number (as of 2015-01-07)
last_num=134
for i in $(seq 0 $last_num)
do
# This echo command will print to screen the URL
# that's currently being downloaded
echo "$base_url?page=$i"
# I'm silencing curl because the progress indicator is annoying
curl "$base_url?page=$i" -s -o "$i.html"
done
Of course, it could be a lot more concise if you're into the whole brevity thing:
for i in $(seq 0 134); do
curl "http://www.whitehouse.gov/briefing-room/press-briefings?page=$i" \
-s -o "$i.html"
done