Week 2

Topics

I'll keep using the word "parser" without fully explaining it. Today we'll be looking at HTML parsing, but it's just the first step in understanding how many other data structures are parsed.

A nice way to examine the concept of parsing, in general, is to look at how human language is parsed. Check out the Stanford Parser and play around with it.

Homework

Lecture notes

Here are several relevant guides that I've created for this week:

  1. A review of the pipes and filters concept
  2. How to grep
  3. Basic regular expressions
  4. How to write a basic shell script
  5. How to run a process in the background
  6. How to setup your Github compciv repo (note: we'll go over this in detail on Wednesday. Don't fret about it on your own)
  7. How to install extra programs on your corn.stanford.edu space - You'll need to do this to get pup, the HTML parser, onto corn.stanford.edu so that you can (sanely) complete future web scraping projects. If you want, you can try installing some of the other programs listed, including the command-line movie-to-GIF maker
  8. Basic HTML parsing with pup - The parsing of HTML (and any other data structure) introduces a different paradigm than what we're used to. We are still working with text, but we'll be using programs specifically designed to parse and make structure out of raw text. In other words, grepping is not really sufficient for complex data structures.

via the gifify tool:

img