While grep and regular expressions are a powerful way to search raw text, when text files already have structure – such as comma-delimited files, or raw HTML – we want to take advantage of programs specifically designed to exploit that structure. With HTML, especially, finding a pattern regular enough (nevermind simple) that a regex can exploit is madness.
So this is why we're using pup, which works from the command-line. Every other parsing library (such as Python and BeautifulSoup) you use will pretty much act the same as pup.
I recommend just trying out pup, as described in the rest of this guide. The arguments it takes are CSS Selectors, which you may be familiar with if you've ever used JQuery.
However, you don't have to know CSS (i.e. how to style webpages) to do HTML parsing. You just have to understand how CSS Selectors are used to target specific HTML elements. Instead of styling these HTML elements, we will be grabbing the text inside them. Different purpose, but same process and syntax of selection.
The pup program is a command-line tool that, when given HTML (i.e. piped in from a file or through curl
), can filter it using CSS selectors. As its homepage says, "pup aims to be a fast and flexible way of exploring HTML from the terminal."
For corn.stanford.edu
, I've written a separate tutorial on installing pup and other command-line tools that aren't part of corn.stanford.edu
's standard offering. Check out that tutorial and then come back here to play with pup.
Let's try it out on the New York Times homepage:
# this fetches the nytimes homepage
curl -L http://www.nytimes.com
# Try it again, but pipe it into pup and select for headlines
curl -s -L http://www.nytimes.com | pup 'h2.story-heading a text{}'
# Let's get the URLs for those headlines
# We want to extract the 'href' attribute:
curl -s -L http://www.nytimes.com | pup 'h2.story-heading a attr{href}'
OK, that was easy (and error-free, I hope). Let's get a more detailed look.
(under construction)
For the vast majority of HTML parsing that you'll ever need do, you'll need to know these things:
For the rest of the examples, check out this sample webpage I've created:
http://www.compciv.org/files/pages/nyt-sample/
To make the examples a little more readable, you can download the page and cache it as a local file (so that you don't have to keep re-downloading it):
curl -s http://www.compciv.org/files/pages/nyt-sample/ -o nyt-sample.html
Given this HTML snippet below, which represents the image that's on that sample page:
<img src="/files/pages/nyt-displays.jpg">
– the tag of this element is img
To select the img
tag, via pup:
cat nyt-sample.html | pup 'img'
This returns:
<img src="/files/pages/nyt-displays.jpg">
In HTML, there's relatively few kinds of tags. To differentiate between elements with the same tag, elements are given different ids or classes.
For example, try selecting all the h1
tags:
cat nyt-sample.html | pup 'h1'
You'll see output that includes this:
<h1 id="main-title">
Stories from the New York Times
</h1>
<h1 class="headline">
<a href="http://www.nytimes.com/2015/01/09/business/honda-fined-70-million-in-underreporting-safety-issues-to-government.html">
Honda Hit With Record Fine for Not Reporting Deaths
</a>
</h1>
...
To select the first h1
element (that has the text, Stories from the New York Times
), we can select it exclusively by targeting its id attribute:
cat nyt-sample.html | pup 'h1#main-title'
In this case, since it happens to be the only element on the page with an id of main-title
, this selector would work just as well:
cat nyt-sample.html | pup '#main-title'
To get the other h1
-tagged elements, we see that they all have a class
of headline
. The dot is used to select for class:
cat nyt-sample.html | pup 'h1.headline'
Given this HTML snippet:
<article>
<h1 class="headline">
<a href="http://www.nytimes.com/2015/01/09/sports/program-prepares-the-chess-prodigy-sam-sevian-for-his-next-moves.html">Youngest U.S. Grandmaster, 14, Weighs His Next Move</a>
</h1>
<p class="description">
After becoming a grandmaster at the tender age of 13, Sam Sevian is getting some help from the chess champion Garry Kasparov.
</p>
</article>
The p
element can be thought of as the child of the article
element. To target that p
element:
cat nyt-sample.html | pup 'article p'
And you can also see that that a
element is a child of the h1
element – which itself is a child of that article
. Here's the most specific way to target that a
element:
cat nyt-sample.html | pup 'article h1 a'
In the img
tag, the src
attribute points to where the image file is physically located:
<img src="/files/pages/nyt-displays.jpg">
To get the src
attribute of this img
tag:
cat nyt-sample.html | pup 'img attr{src}'
/files/pages/nyt-displays.jpg
The attribute that you'll deal with the most in web-scraping is the href
attribute which is part of standard a
-tagged elements (i.e. anchor links, or, "hyperlinks").
To get all the values of the href
attributes for all the a
tags on the page:
cat nyt-sample.html | pup 'a attr{href}'
http://www.nytimes.com
https://www.flickr.com/photos/zokuga/5804588208/in/photostream/
http://www.nytimes.com/2015/01/09/business/honda-fined-70-million-in-underreporting-safety-issues-to-government.html
http://www.nytimes.com/2015/01/09/sports/program-prepares-the-chess-prodigy-sam-sevian-for-his-next-moves.html
http://www.nytimes.com/2015/01/09/us/in-san-franciscos-tenderloin-a-move-to-help-artists-as-wealth-moves-in.html
http://nytimes.com/2015/01/09/opinion/the-stumbling-tumbling-euro.html
http://www.nytimes.com/2015/01/09/business/democrats-step-up-efforts-to-block-obama-on-trade-promotion-authority.html
To get all the values of the href
attributes for just the a
-tagged elements that are children of the h1
-tagged elements (with a class of headline
):
cat nyt-sample.html | pup 'h1.headline a attr{href}'
http://www.nytimes.com/2015/01/09/business/honda-fined-70-million-in-underreporting-safety-issues-to-government.html
http://www.nytimes.com/2015/01/09/sports/program-prepares-the-chess-prodigy-sam-sevian-for-his-next-moves.html
http://www.nytimes.com/2015/01/09/us/in-san-franciscos-tenderloin-a-move-to-help-artists-as-wealth-moves-in.html
http://nytimes.com/2015/01/09/opinion/the-stumbling-tumbling-euro.html
http://www.nytimes.com/2015/01/09/business/democrats-step-up-efforts-to-block-obama-on-trade-promotion-authority.html
Think of the text elements as the literal text that you see on a page when rendered by the browser.
For example, given this HTML snippet:
<h1 id="main-title">Stories from the New York Times</h1>
The text of the h1
element is "Stories from the New York Times"
Using pup to select only the text of that h1
element:
cat nyt-sample.html | pup 'h1#main-title text{}'
Stories from the New York Times