HTML parsing for pup

Using the pup tool to more sanely extract data from HTML files

About HTML parsing

While grep and regular expressions are a powerful way to search raw text, when text files already have structure – such as comma-delimited files, or raw HTML – we want to take advantage of programs specifically designed to exploit that structure. With HTML, especially, finding a pattern regular enough (nevermind simple) that a regex can exploit is madness.

So this is why we're using pup, which works from the command-line. Every other parsing library (such as Python and BeautifulSoup) you use will pretty much act the same as pup.

I recommend just trying out pup, as described in the rest of this guide. The arguments it takes are CSS Selectors, which you may be familiar with if you've ever used JQuery.

However, you don't have to know CSS (i.e. how to style webpages) to do HTML parsing. You just have to understand how CSS Selectors are used to target specific HTML elements. Instead of styling these HTML elements, we will be grabbing the text inside them. Different purpose, but same process and syntax of selection.

About pup

The pup program is a command-line tool that, when given HTML (i.e. piped in from a file or through curl), can filter it using CSS selectors. As its homepage says, "pup aims to be a fast and flexible way of exploring HTML from the terminal."

For corn.stanford.edu, I've written a separate tutorial on installing pup and other command-line tools that aren't part of corn.stanford.edu's standard offering. Check out that tutorial and then come back here to play with pup.

Play with pup

Let's try it out on the New York Times homepage:

# this fetches the nytimes homepage
curl -L http://www.nytimes.com
# Try it again, but pipe it into pup and select for headlines
curl -s -L http://www.nytimes.com | pup 'h2.story-heading a text{}'
# Let's get the URLs for those headlines
# We want to extract the 'href' attribute:
curl -s -L http://www.nytimes.com | pup 'h2.story-heading a attr{href}'

OK, that was easy (and error-free, I hope). Let's get a more detailed look.

A quick primer of HTML parsing

(under construction)

For the vast majority of HTML parsing that you'll ever need do, you'll need to know these things:

  1. How to select an HTML element by tag.
  2. How to select an HTML element by id or by class.
  3. How to select a child HTML element.
  4. How to select an attribute of an HTML element.
  5. How to select the text of an HTML element.

For the rest of the examples, check out this sample webpage I've created:

http://www.compciv.org/files/pages/nyt-sample/

To make the examples a little more readable, you can download the page and cache it as a local file (so that you don't have to keep re-downloading it):

curl -s http://www.compciv.org/files/pages/nyt-sample/ -o nyt-sample.html
Selecting elements by tag

Given this HTML snippet below, which represents the image that's on that sample page:

<img src="/files/pages/nyt-displays.jpg">

– the tag of this element is img

To select the img tag, via pup:

cat nyt-sample.html | pup 'img'

This returns:

<img src="/files/pages/nyt-displays.jpg">
Selecting elements by id or class

In HTML, there's relatively few kinds of tags. To differentiate between elements with the same tag, elements are given different ids or classes.

For example, try selecting all the h1 tags:

cat nyt-sample.html | pup 'h1'

You'll see output that includes this:

<h1 id="main-title">
 Stories from the New York Times
</h1>
<h1 class="headline">
 <a href="http://www.nytimes.com/2015/01/09/business/honda-fined-70-million-in-underreporting-safety-issues-to-government.html">
  Honda Hit With Record Fine for Not Reporting Deaths
 </a>
</h1>
...
Selecting elements by id attribute

To select the first h1 element (that has the text, Stories from the New York Times), we can select it exclusively by targeting its id attribute:

cat nyt-sample.html | pup 'h1#main-title'

In this case, since it happens to be the only element on the page with an id of main-title, this selector would work just as well:

cat nyt-sample.html | pup '#main-title'
Selecting elements by class attribute

To get the other h1-tagged elements, we see that they all have a class of headline. The dot is used to select for class:

cat nyt-sample.html | pup 'h1.headline'

Selecting child elements

Given this HTML snippet:

<article>
  <h1 class="headline">
    <a href="http://www.nytimes.com/2015/01/09/sports/program-prepares-the-chess-prodigy-sam-sevian-for-his-next-moves.html">Youngest U.S. Grandmaster, 14, Weighs His Next Move</a>
  </h1>
  <p class="description">
    After becoming a grandmaster at the tender age of 13, Sam Sevian is getting some help from the chess champion Garry Kasparov.
  </p>
</article>

The p element can be thought of as the child of the article element. To target that p element:

cat nyt-sample.html | pup 'article p'

And you can also see that that a element is a child of the h1 element – which itself is a child of that article. Here's the most specific way to target that a element:

cat nyt-sample.html | pup 'article h1 a'
Selecting the attribute value of an element

In the img tag, the src attribute points to where the image file is physically located:

<img src="/files/pages/nyt-displays.jpg">

To get the src attribute of this img tag:

cat nyt-sample.html | pup 'img attr{src}'
/files/pages/nyt-displays.jpg

The attribute that you'll deal with the most in web-scraping is the href attribute which is part of standard a-tagged elements (i.e. anchor links, or, "hyperlinks").

To get all the values of the href attributes for all the a tags on the page:

cat nyt-sample.html | pup 'a attr{href}'
http://www.nytimes.com
https://www.flickr.com/photos/zokuga/5804588208/in/photostream/
http://www.nytimes.com/2015/01/09/business/honda-fined-70-million-in-underreporting-safety-issues-to-government.html
http://www.nytimes.com/2015/01/09/sports/program-prepares-the-chess-prodigy-sam-sevian-for-his-next-moves.html
http://www.nytimes.com/2015/01/09/us/in-san-franciscos-tenderloin-a-move-to-help-artists-as-wealth-moves-in.html
http://nytimes.com/2015/01/09/opinion/the-stumbling-tumbling-euro.html
http://www.nytimes.com/2015/01/09/business/democrats-step-up-efforts-to-block-obama-on-trade-promotion-authority.html

To get all the values of the href attributes for just the a-tagged elements that are children of the h1-tagged elements (with a class of headline):

cat nyt-sample.html | pup 'h1.headline a attr{href}'
http://www.nytimes.com/2015/01/09/business/honda-fined-70-million-in-underreporting-safety-issues-to-government.html
http://www.nytimes.com/2015/01/09/sports/program-prepares-the-chess-prodigy-sam-sevian-for-his-next-moves.html
http://www.nytimes.com/2015/01/09/us/in-san-franciscos-tenderloin-a-move-to-help-artists-as-wealth-moves-in.html
http://nytimes.com/2015/01/09/opinion/the-stumbling-tumbling-euro.html
http://www.nytimes.com/2015/01/09/business/democrats-step-up-efforts-to-block-obama-on-trade-promotion-authority.html
Selecting text elements

Think of the text elements as the literal text that you see on a page when rendered by the browser.

For example, given this HTML snippet:

<h1 id="main-title">Stories from the New York Times</h1>

The text of the h1 element is "Stories from the New York Times"

Using pup to select only the text of that h1 element:

cat nyt-sample.html | pup 'h1#main-title text{}'
Stories from the New York Times