Death Row rows parsing

Collect and aggregate data from three different states’ death row listings.

Due: Friday, January 30
Points: 10

This is another exercise in HTML and plaintext (well, and PDF, too) parsing. It will involve compiling death row data from three different state criminal justice institutions:

This is an extension of the HTML web scraping exercise here (the first part of it, not the rambling parts after that).

The HTML and text parsing, you already know how to do. PDF parsing, at least in this case, is not much different. And now you can say you’ve had some experience in the fun field of working with and coordinating datasets from different jurisdictions.

This exercise will extract the Race/Ethnicity data from each data source and in the final step, will print to stdout the combined data from each data source. In other words, the script will run the three other scripts, process and standardize their taxonomy, and output a text stream suitable to pass into another filter (to do a count by state/race, for example):


So if the concepts of separating concerns (between different scripts), stdin/stdout, and shell scripting weren’t clear to you now, they will be after this exercise. Also, you’ll get to examine the data produced by parts of our criminal justice system. This exercise involves extracting the race/ethnic column because it’s the least difficult to parse across all the different data sources, even though the taxonomy is different.


  • A project folder named "death-row-parsing"

    In your compciv repo, create a folder named:


    This is where your code will exist.

  • The script, ""

  • The script, ""

    The script should parse the two Texas death row pages and when run, should output the Race column, as collected from both pages as a text stream, e.g.


    The exact code to do it for the current death row listing can be found in the Scraping the Texas death row list tutorial. The code to extract the data from the 1923-1973 page should be pretty similar.

  • The script, ""

    The script should parse the three Florida death penalty pages and output the “Race/Gender column”, as collected from the three pages as a text stream, e.g.

  • The script, ''

    The script should parse the California condemned inmate list and output the “Ethnic Code” column as a text stream, e.g.

  • The script, ``

    The script should run the other three scripts and produce a text stream:


    The order of the data doesn’t matter, but what does matter is that the terms used for race are standardized in the output. That is, even though each state uses different terms, the should produced a standardized output using these terms:

    • Black
    • Hispanic
    • Other
    • White
  • Hints

    The parsing here is no more difficult than what you've seen in past exercises, including the initial pup tutorial and the White House briefings homework. However, there's enough distinct, moving parts that you should probably not wait until the last day to start this.

    The most unfamiliar part of this assignment will be figuring out what is supposed to do. Before you worry about it, ask yourself this: Have I finished the three other state scrapers so that they output exactly what was asked for? If not, then review this clip from The Wire (below), read the Wikipedia article on Separation of concerns, and finish those three scrapers before worrying about

    Remember stdout

    Again, if you don't understand the concept of stdout, of having programs print to standard output so that other programs (in this case, can use their output, then re-read this brief primer on the concept.

    Each of the three state-focused scripts focus on extracting and outputting the raw data from their respective sources. They don't have to worry about anything than doing their data-extraction job and printing/echoing to stdout. They let take care of making the terms uniform (i.e. converting WM to White), and so forth.

    This means that will itself be calling each of the three scripts, reading their output, and translating/converting as necessary to produce the desired stream:


    If you run the output of through the filters we've used before, you should get answers similar to this:

    # Showing the broken down totals for each state
    user@host:~$ bash
        272 CA,Black
        178 CA,Hispanic
         40 CA,Other
        256 CA,White
        312 FL,Black
         16 FL,Hispanic
          3 FL,Other
        348 FL,White
        399 TX,Black
        122 TX,Hispanic
          5 TX,Other
        253 TX,White
    # Showing the totals across three states
    user@host:~$ bash | cut -d ',' -f 2 | sort | uniq -c
        983 Black
        316 Hispanic
         48 Other
        857 White

    The damn  

    In some of your scrapes, you may notice a character in the raw HTML that is denoted as  …which is invisible when viewing it via browser or in the command-line Terminal. This is a non-breaking space…there's not much reason to go into it except it can often be a severe pain in the ass. My advice…if it keeps coming up, that you filter for text values that contain at least one alphabetical character, i.e.:

     cat whatever.html | pup '.whatever' | grep '[[:alpha:]]'

    However, you can also address the   character with this literal string, and filter it out using the -v inverse filter (-P is a flavor of regex that is necessary here):

      cat whatever.html | pup '.whatever' | cat grep -vP '\xc2\xa0'


    You should probably finish this first, as the parsing code will be very similar to the one found in this tutorial I've written. You'll just be running it twice (one for each of the Texas pages). The Texas website also has the racial statistics posted, so you can use them to double-check your work


    There are three pages here, but the concept is the same as it is for, with the data being inside a HTML table with a "Race/Gender" column. You may be wondering what to do with the gender information after you've extracted it: again, that's for to worry about.


    This is kind of a trick question: this list isn't HTML, it's a PDF. You may be wondering, "well, how do I turn a PDF into text?", but I'll leave that to you and Google.

    Suffice to say, the recommended tool is already installed on, and if you're using the one that I think is the most obvious to use, look at its help page, and one of its options will allow you to convert the PDF into text in a format that looks like this:


    And at that point, it's just using good ol' grep and a regular expression to extract the terms relating to race.

    If you find the pattern difficult to isolate in a single grep, I would highly suggest using two grep calls, based on the observation that, according to the data in that PDF, there is one common physical aspect about every inmate whose race is listed.


    If you've made it this far, congratulations. This means that the other three scripts, when run, are outputting their extracted race data to standard output.

    This means that will run each script, and then perform the necessary translation needed to, for example, in the case of California, convert OTH to Other, and so forth.

    The four racial categories that are common to all three states are:

    The will produce a stream consisting of those terms and the state from which each data row was produced, e.g:


    Since is running each of the scripts, and controlling which each script outputs, it also can output the state identifier (and the comma separator) with each line.

    Using sed

    We haven't formerly covered sed, but I think it will be by far the easiest way to convert the output. sed is a big program, but you should only need to use its immensely useful substitution command.

    Think of sed as a much more flexible and powerful version of the tr tool, as tr can only convert one kind of character at a time.

    Here's sample usage:

    user@host:~$ echo 'WM' | sed 's/W/White/'

    Don't be afraid to use as many sed filters in succession as you need for filtering out a stream:

    user@host:~$ echo 'WM' | sed 's/M//' | sed 's/W/White/' 

    You can also take advantage of regular expressions with sed, though I think the text you need to translate is simple enough that you can stick with just translating literal characters.

    Find more examples of sed and substitution here.

    Again, it can't be emphasized enough that you'll want to do things step-by-step_, and to understand that what makes the logistics here possible (or at least, relatively simple) is the idea of text as a universal interface.


    You can see a copy of the solution on Github

    The file is in charge of preparing all of the data, including downloading it, such that none of the other scripts should involve a curl or anything other than text/html-processing tools:

    mkdir -p data-hold
    # Download Texas
    curl -s > data-hold/tx_dr.html
    curl -s > data-hold/tx_dr_past.html
    # Download Florida
    curl -s > data-hold/fl_dr_roster.html
    curl -s > data-hold/fl_dr_list2.html
    curl -s > data-hold/fl_dr_list.html
    # Download California data
    curl -s > data-hold/ca_dr.pdf
    # Convert the California data into a PDF
    pdftotext -layout data-hold/ca_dr.pdf data-hold/ca_dr.txt

    The script is pretty straightforward and follows the same strategy outlined in the HTML-parsing tutorial here. The two different pages simply list race in a different column:

    cat data-hold/tx_dr.html | pup 'td:nth-of-type(7) text{}'
    cat data-hold/tx_dr_past.html | pup 'td:nth-of-type(4) text{}' 

    The script involves the same kind of pup call across all three pages (thus, I'm able to cat them all together), though because there are extra rows (corresponding to multiple-crimes per inmates), we have to use a grep to remove columns that don't have upper-case characters in them. The following is actually just a one-line command involving cat, pup, and grep, though I've split it up via backslash:

    cat data-hold/fl_dr_list2.html     \
        data-hold/fl_dr_list.html      \
        data-hold/fl_dr_roster.html |  \
    # grab the text of the third column
      pup 'td:nth-of-type(3) text{}' | \
    # match only the lines that have uppercase ltters in them
    # as some columns contain  , i.e. the blank non-breaking spaces
      grep -Eo '\b[[:upper:]]{2}\b'

    The script does not use pup, because after the pdftotext conversion, the data is just plain text, not HTML.

    You could extract the text successfully by just using grep a couple of times, as well as remembering the purpose of regular expressions: to describe patterns in text. Please continue to remember and review them, as they will make your life much, much easier in any kind of text task you will ever encounter in the real world (which will be many occasions):

    cat data-hold/ca_dr.txt | \
    # Cut off everything before "Living"
      grep -oE 'Living.+' | \
    # Grab only boundedthree-capital letters
      grep -oE '\b[A-Z]{3}\b'

    As you should've noticed by now, the other three scripts just care about printing text to standard output. This leaves the script to call each of them and modify the text as needed:

    # Printing Texas
    # There is one substitution necessary, as there was one inmate
    # classified as "Native    American"
    bash | sed -E 's/Native +American/Other/' | \
    # the carat (not inside of brackets) denotes the beginning of the line
    # so I'm effectively saying, substitute "TX," at the beginning of each line
          sed s/^/TX,/
    # Printing Florida
    # since we're discarding gender, we just need to substitute based
    # on the first character
    bash | sed s/B./Black/ | sed s/W./White/ | \
       sed s/H./Hispanic/ | sed s/O./Other/ | \
       sed s/^/FL,/
    # Printing California
    bash | sed s/WHI/White/ | sed s/BLA/Black/ | \
       sed s/HIS/Hispanic/ | sed s/OTH/Other/ | \
       sed s/^/CA,/

    Common mistakes

    Not everything is HTML

    Remember that each parsing tool we've learned, including jq and pup, are meant for a specific text format, such as JSON or HTML. When you are dealing with text that does not follow the format, then you do not use those tools. Those are not tools for text in the way that grep is; they are tools for parsing.

    Not separating concerns

    The purpose of the is to download things once, so that the other scripts can happily focus on the parsing/printing parts of their jobs. So if your has curled the appropriate files, then the individual parsing scripts should not repeat the process. Similarly, should not have any calls to expects only to deal with plaintext by the time it goes into action.

    Repeating yourself


    bash | sed s/Black/TX,Black/     | \
        sed s/White/TX,White/   |  \
        sed s/Hispanic/TX,Hispanic/   |  \
        sed s/Other/TX,Other/ 

    Remember that being able to express yourself in patterns is one of the superpowers of programming. You just scraped two pages containing 750+ data points with just two lines of code because you were able to exploit a pattern…you should be indignant at the thought of having to copy-paste that sed command four times just to insert the damn TX, in front of each line. This assignment was only a toy example; imagine if the assignment was to get inmate listings for all 50 states (or at least the ones that still have capital punishment) - the probability that you'd be able to type out all those sed commands without making a mistake is very slim.

    Learn to express your actions in the most minimal way possible. It's not that your goal is to replace White with TX,White, it's that you need to insert TX, at the beginning of each line. And there's a regular expression pattern for that:

    bash | sed s/^/TX,/