Week 3

Relevant guides

Scraping the Texas death row list
Using jq to parse JSON from the command-line
Installing Ruby 2.1 and Twitter API gems on Stanford Farmshare - We won't need to do this…yet, as we'll be parsing pre-downloaded samples of Twitter data. But soon, we'll want to interact directly with the service.

Class notes

In our initial work with the Bash command-line interface, we've stuck very closely to the Unix philosophy that text is a universal interface for programs.

And that philosophy should stick in the forefront of your mind going forward, as it will be useful in virtually any kind of data analysis, processing, or visualization tasks that you'll ever do.

So as we now learn "new" data formats, such as JSON or HTML or even just normal human language, the "new" aspect of them is that they (in theory) follow a predefined format. But they are still text. And the parsers we use, whether it's pup, jq, BeautifulSoup, etc., they follow the rules of that format and provide a convenience layer for us, the user, to then extract the data from those formats. But it is all just text.

Data formats and APIs

Next week, we'll be focusing on APIs (Application Programming Interfaces) as a source for data. For online services, APIs are also used as a way to produce data, such as sending a tweet. Think of an API as a specification for how a service lets you access its data. For example, if you want to get information about an artist on Spotify, you use its endpoint for artists:

    https://api.spotify.com/v1/artists/{id}

And you read the documentation about what's required to access that endpoint, in this case, the Spotify ID of the artist you want. Here's a sample curl call:

  curl  "https://api.spotify.com/v1/artists/0OdUWJ0sBjDrqHygGUXeCF"

However, this week, we'll start off by looking at data samples. Here's the returned result of that Spotify artist call:

{
  "external_urls" : {
    "spotify" : "https://open.spotify.com/artist/0OdUWJ0sBjDrqHygGUXeCF"
  },
  "followers" : {
    "href" : null,
    "total" : 306565
  },
  "genres" : [ "indie folk", "indie pop" ],
  "href" : "https://api.spotify.com/v1/artists/0OdUWJ0sBjDrqHygGUXeCF",
  "id" : "0OdUWJ0sBjDrqHygGUXeCF",
  "images" : [ {
    "height" : 816,
    "url" : "https://i.scdn.co/image/eb266625dab075341e8c4378a177a27370f91903",
    "width" : 1000
  }, {
    "height" : 522,
    "url" : "https://i.scdn.co/image/2f91c3cace3c5a6a48f3d0e2fd21364d4911b332",
    "width" : 640
  }, {
    "height" : 163,
    "url" : "https://i.scdn.co/image/2efc93d7ee88435116093274980f04ebceb7b527",
    "width" : 200
  }, {
    "height" : 52,
    "url" : "https://i.scdn.co/image/4f25297750dfa4051195c36809a9049f6b841a23",
    "width" : 64
  } ],
  "name" : "Band of Horses",
  "popularity" : 59,
  "type" : "artist",
  "uri" : "spotify:artist:0OdUWJ0sBjDrqHygGUXeCF"
}

Working with APIs requires a different set of steps. However, working with their data, once it's downloaded, can be done whether the data was just freshly retrieved or copied from an example, like above. We want to be comfortable with the data formats, so that when we need to write the programs that retrieve fresh data from the APIs, we know what we're getting.

This is no different than downloading a webpage into a local file, and running commands on just that local file. And again, it's all just text.

Most APIs today use the JSON format, which we'll be parsing from the command-line with jq.

The big picture: JSON is just another structure that data comes in, though much easier to deal with than scraping HTML, and a more natural fit than CSV/spreadsheets for describing complex relationships.

Check out Twitter's API documentation for user profiles and user timelines.

A lot of our focus is going to shift from programming fundamentals to understanding (and researching) data sources and thinking of ways to use them. Expect to get more comfortable with the curl tool for not just getting data, but posting it, too.

A few other interesting APIs

Spotify
- Artists
- Tracks
Instagram
Facebook
New York Times:
Sunlight Foundation
- Capitol Words - Word frequency count in Congress
- Congress
- Docket Wrench
Foursquare
Marvel Comics
Google Maps
Accuweather
USGS Earthquake API
NOAA Tides and Currents API

JSON samples

Using jq

Use the jq JSON command-line parser to try things out. Check out its tutorial first. Move on to the manual if you're feeling exploratory. jq even has an interactive playground on the Web.

A few examples using the Spotify sample JSONs above:

# stash the Beyonce json in a local file:
curl -s http://stash.compciv.org/samples/spotify/spotify-beyonce.json \
  > tmp-beyonce.json

# view the JSON as parsed and colorized by jq
cat tmp-beyonce.json | jq '.'

# Get the number of Beyonce's followers on Spotify:
cat tmp-beyonce.json | jq '.followers.total'

# stash the Beyonce I AM SASHA FIERCE album json in a local file:
curl -s http://stash.compciv.org/samples/spotify/spotify-beyonce-sasha-fierce.json > tmp-sasha.json

# get the names of tracks:
cat tmp-sasha.json | jq '.tracks[].name'

# get the names of tracks, followed by their popularity:
cat tmp-sasha.json | jq '.tracks[] | .name, .popularity'

# get the list of MP3 sample previews, in plaintext format:
cat tmp-sasha.json | jq --raw-output '.tracks[] .preview_url'

# (If on a Mac OSX)
# download the first MP3 preview
curl -o sample.mp3 \
  $(cat tmp-sasha.json | jq --raw-output '.tracks[] .preview_url' | head -n 1)
open sample.mp3

Regarding homework

Debugging

I appreciate that some of you think I'm a Unix-overlord, but in general, if your program isn't working, I'm going to need some context. Such as: what is the error message that you're getting?__ And: _what did you do leading up to that? Or, if you aren't getting an error, but just an unexpected result: what did you get and why do you think that was unexpected?

Being able to answer those questions is a fundamental skill in understanding not just computers…but the world in general. There's not just one reason why something doesn't work. At least with computers, we have many tools to explore those reasons in very quick ways, so let's use that to our advantage.

In fact, before you send me a help request, use this chance to practice your command-line skills:

# Start the message. You can use nano.
# Or, to stay at the commandline, use a HEREDOC:
# http://tldp.org/LDP/abs/html/here-docs.html
cat > helpme.txt <<'EOF'
Hey Dan, I can't figure out what's wrong with what I'm doing. I did
such-and-such command and it isn't returning this-and-that, as I expected.
Should I be doing this, or this other thing, etc?
EOF

# send the last 50 commands you've done:
history | tail -n 50 >> helpme.txt
# the "env" command contains your system's settings, though you may 
# want to make sure you don't have anything private in there that you
# don't want to share over email
env >> 'helpme.txt'

cat helpme.txt | mail dun@stanford.edu -s 'Having problems with blah-blah'

Breaking up the tasks

A scene from The Wire that came to mind when troubleshooting students' work on (the multi-part WH briefings scrape/grep):

To paraphrase the scene:

Do one thing and do it well. Do another thing when it's your turn to do another thing.
Why do you care? – about anything other than the one thing you have to do?

The concept of separation of concerns - "design principle for separating a computer program into distinct sections, such that each section addresses a separate concern" - is not really a technical concept as much as a human one, and one that consumes a lot of brain power in professional software engineering. And it's not just a software concept either – when dealing with data or information collection, you should almost always assume that you're dealing with the product of divided entities, not a unified authority. That should make you feel a little better as you spend hours cleaning and reconciling real-world data.

For the purposes of this class, separation of concerns is just the most logical way to deal with increasingly complicated tasks, which is essential if you're relatively new to programming. I'll try to describe the assignments in what I think are the most logical "sections", though you can do what you want. I'll post a detailed answer later for the WH briefings scrape/grep) assignment. But many problems arose from not realizing how or when to separate the concerns between Step 1 (grepping/extracting URLs from the WH briefings list) and Step 2 (using those URLs to download more pages from whitehouse.gov).

One example: Why should Step 1 care about producing absolute URLs when it's Step 2's job to actually download from those URLs? In fact, why does Step 1 have to know anything about the Internet? All it needs to do is grep a text file that is in "HTML", but that is not the "Internet". If you knew how to use a for-loop in Step 2, then the problem of making absolute URLs becomes much easier