Homework

Collecting and analyzing job listings from the USAJobs.gov API

Ask what you can do for your country, and what your country can pay you.

Due: Friday, February 6

Points: 10

Using the JSON API from USAJobs.gov, we’ll write a scraper that collects all the current job openings as raw data, stores them into a time-stamped directory, and does a quick analysis of the highs and lows for salaried positions.

This exercise is a repeat of the scraping-to-analysis exercises we’ve done before, including more parsing of JSON with jq, and you’ll have to write some runtime execution logic to deal with job categories that have multiple pages of job openings.

Because the USAJobs.gov API is free, and relatively forgiving, this is a good time to practice the concepts involved with interacting with a remote API. But try to avoid writing an infinite loop in your program.

The USAJobs.gov site search is actually pretty robust. However, being able to collect and parse the data as we please gives us a lot more speed and flexibility in performing queries to find interesting or specific jobs. If we wanted to study job posting trends over time, or perhaps create a job-search site of our own, then having a script that can be set to automatically run and collect data every day would be an extremely effective tool.

You can download a zipped snapshot of the job openings here

Deliverables

A project folder named `~/compciv/homework/usajobsgov`

The directory structure will look something like this:

  |-compciv
    |-homework
       |--usajobsgov
          |--scraper.sh
          |--analyzer.sh
          |--data-hold
             |--OccupationalSeries.xml
             |--scrapes
                |----2015-01-20_1500
                  |--0000-1.json
                  |--0000-2.json
                  |--0100-1.json

A script named `scraper.sh`

Creating a time-stamped directory

Upon launch, the scraper.sh script will create a timestamped directory in ./data-hold/scrapes, in this format:

  YYYY-MM-DD_HH00

For example, if scraper.sh is run on Jan 20th, 2015, at 3:45PM, it should create this directory:

  ./data-hold/scrapes/2015-01-20_1500

(this particular arrangement means that a scrape that runs at 3:10PM and 3:45PM on 1/20/2015 will save to the same directory, 2015-01-20_1500, which is fine for our purposes)

Collecting the JobFamily values

Then, scraper.sh will parse the data-hold/OccupationalSeries.xml file to find all of the JobFamily values, e.g. 0000, 2200, 9900.

Collecting data from data.usajobs.gov/api/jobs

For each of the JobFamily values, make the appropriate curl to the JSON API from USAJobs.gov and retrieve all the current job openings for that “JobFamily”. Check out the USAJobs.gov documentation on API Query Parameters] for more information on what you need to curl.

Paginate when necessary

In cases where there are more job postings than can fit in a single response, the scraper.sh script should loop and collect each page.

After each visit to the https://data.usajobs.gov/api/jobs endpoint, the scraper.sh should save a file in the timestamped directory, one for each page of each “JobFamily”.

So, if the JobFamily of 2200 had 4 pages of job openings, scraper.sh would save these files:

      |--scrapes
         |----2015-01-20_1500
                |--2200-1.json
                |--2200-2.json
                |--2200-3.json
                |--2200-4.json

A script named `analyzer.sh`

When the analyzer.sh script is executed, it expects one argument to be passed in: a timestamped sub-directory within scrapes, i.e:

      bash analyzer.sh 2015-01-20_1500

The analyzer.sh script will then:

Collect just the job postings that have a salary-basis as “Per Year”
Collect and count the unique job titles
Select the 25 most frequently occurring job titles
For each of these job-titles, print pipe-delimited-output that includes the job title, the minimum salary, and the maximum salary among the collected job records:

In other words, each job title will presumably have more than one job-listing. The analyzer.sh script prints a simple report showing the variance in possible salaries.

Here’s what the output looked like for jobs posted on January 26:

    Transportation Security Officer (TSO)|31203.00|52184.00
    Physician (Psychiatrist)|97987.00|250000.00
    Physician (Primary Care)|97987.00|215000.00
    Physical Therapist|39179.00|104306.00
    Contract Specialist|40336.00|172443.00
    Social Worker|49285.00|126949.00
    Program Analyst|47684.00|158700.00
    Medical Technologist|36379.00|107434.00
    Physician Assistant|57798.00|119443.00
    Physician (Hospitalist)|97987.00|240000.00
    Supply Technician|31315.00|61994.00
    Medical Support Assistant|25434.00|49166.00
    Clinical Psychologist|57408.00|118515.00
    Advanced Medical Support Assistant|35256.00|54806.00
    Physician|97987.00|325000.00
    Auditor|34576.00|116901.00
    Civil Engineer|36379.00|143152.00
    Physician (Gastroenterology)|97987.00|320000.00
    Interdisciplinary|31944.00|158700.00
    Dental Assistant|25434.00|50374.00
    Budget Analyst|39179.00|149333.00
    Public Affairs Specialist|50073.00|139523.00
    Psychiatrist|97987.00|260000.00
    Occupational Therapist|48403.00|91255.00
    Physician (Psychiatry)|96539.00|260000.00

Hints

I'll be brief in directions here, as this exercise follows all the patterns and strategies you've practiced before.

Here's a sample endpoint and response for the JobFamily of 2200:

https://data.usajobs.gov/api/jobs?series=2210

{
  "TotalJobs": "183",
  "JobData": [
    {
      "DocumentID": "391383700",
      "JobTitle": "Information Technology Specialist",
      "OrganizationName": "Department Of Health And Human Services",
      "AgencySubElement": "Centers for Medicare & Medicaid Services",
      "SalaryMin": "$76,378.00",
      "SalaryMax": "$99,296.00",
      "SalaryBasis": "Per Year",
      "StartDate": "1/16/2015",
      "EndDate": "1/28/2015",
      "WhoMayApplyText": "United States Citizens",
      "PayPlan": "GS",
      "Series": "2210",
      "Grade": "12/12",
      "WorkSchedule": "Full Time",
      "WorkType": "Permanent",
      "Locations": "Woodlawn, Maryland",
      "AnnouncementNumber": "CMS-OTS-DE-15-1301322",
      "JobSummary": "CMS' effectiveness depends on the capabilities of a dedicated, professional staff that is committed to supporting these objectives. A career with CMS offers the opportunity to get involved on important national health care issues and be part of a dynamic, fast-paced, and highly visible organization. For more information on CMS, please visit: http://www.cms.gov/ . This position is located in the Department of Health & Human Services (HHS), Centers for Medicare & Medicaid Services (CMS), Office of Technology Solutions (OTS), Woodlawn, MD. WHO MAY APPLY: This is a competitive vacancy, open to all United States Citizens or Nationals, advertised under Delegated Examining Authority....",
      "ApplyOnlineURL": "https://www.usajobs.gov/GetJob/ViewDetails/391383700?PostingChannelID=RESTAPI"
    }
  ],
  "Pages": "8"
}

Setup

While you can get the OccupationalSeries.xml file from the USAJobs.gov page, it prevents a direct curl. So I've made a copy and you can download it like this:

curl -o ./data-hold/OccupationalSeries.xml http://stash.compciv.org/usajobs.gov/OccupationalSeries.xml

Parsing OccupationalSeries.xml

So the first thing you need to do is get a list of job categories, which, in the parlance of USAjobs.gov, is referred to as JobFamily.

The OccupationalSeries.xml contains a list of JobFamily values. You'll iterate through each of these to get all of the job openings on data.usajobs.gov.

While HTML is a subset of XML, you can't parse the OccupationalSeries.xml with pup.

However, corn.stanford.edu has the hxselect program which can be used to parse XML in very much the same manner as pup. Check out the hxselect documentation to see how it is used.

Hint: The OccupationalSeries.xml file has a confusing layout. There are effectively two lists. Only one of those lists contains just the unique JobFamily values.

Creating a time-stamped directory

Check out the tools page to see how date can be used to create a date-formatted string, which you can use for the directory's name.

Read the documentation

Read the documentation, especially the part about API Query Parameters. Besides the Series parameter, the only other parameter to really care about is the Page parameter. You need to be able to paginate through all of the jobs.

Hint: There is one other parameter that is useful for reducing the number of pages you have to traverse; all of the other parameters basically narrow the search field, which you don't want.

Do a simple scrape first

Before worrying about getting all the pages, worry about correctly iterating through all the different JobFamily codes first, as if you only had to collect one page each.

And use an echo statement to see what's happening:

for jobfamily in $jobfamilies; do 
  page_count=1
  echo "Fetching jobs in $jobfamily, page $page_count"
  # etc etc
done

You should get output that looks like this:

Fetching jobs in 0000, page 1
Fetching jobs in 0100, page 1
Fetching jobs in 0200, page 1
Fetching jobs in 0400, page 1
Fetching jobs in 0500, page 1

Then, when you adjust your code to do multi-page downloading (it will likely require a for-loop within a for-loop), your echo output should look like this:

Fetching jobs in 0000, page 1
Fetching jobs in 0100, page 1
Fetching jobs in 0100, page 2
Fetching jobs in 0100, page 3
Fetching jobs in 0200, page 1
Fetching jobs in 0300, page 1
Fetching jobs in 0300, page 2
Fetching jobs in 0300, page 3
Fetching jobs in 0400, page 1
Fetching jobs in 0400, page 2
Fetching jobs in 0500, page 1

Basically, you want to avoid hammering the data.usajobs.gov site with the same call, over and over and over and over.

Hint: don't use the word jobs as a variable name. The word jobs is already the name of a Unix command.

Another big hint: When doing a curl of the URL, enclose it in double-quotes, e.g.: curl "http://data.usajobs.gov/etc&etc"…the ampersands in the URL can cause problems for you.

Parsing with JQ

If you want to practice parsing with jq without waiting to finish implementing scraper.sh, you can download a zipped snapshot of the job openings here.

Getting analyzer.sh to produce the correct output will involve a combination of basic and maybe some fancy jq usage, and good old-fashioned Unix tools like grep and sort .

Before writing analyzer.sh, try parsing the collected data for individual fields to get an idea of what gets returned:

cat *.json | jq -r '. .JobData[] | .SalaryMax' | sort | uniq -c | sort -r | head -n 10

    201 $215,000.00
    161 $76,131.00
    153 $240,000.00
    150 $91,255.00
    120 $51,437.00
    102 $195,000.00
     98 $62,920.00
     97 $108,507.00
     94 $158,700.00
     89 $46,294.00

cat *.json | jq -r '. .JobData[] | .SalaryBasis' | sort | uniq -c | sort

      1 Bi-weekly
      1 Per Day
      1 Per Month
      1 Student Stipend Paid
      2 School Year
      8 Fee Basis
     11 Without Compensation
   1300 Per Hour
   5600 Per Year

Using jq's select

Remember that the output should include only jobs that are "Per Year". This requires using the select function for jq. Rather than risk you going off the deep-end with using grep, here's an example you can use and modify:

yearly_jobs=$(cat *.json | jq '.JobData[] | select(.SalaryBasis == "Per Year")')

That code snippet will select all items in the .JobData array that have an attribute of .SalaryBasis in which the value is Per Year

New hint: Transforming the job data

For the rest of analyzer.sh, you only care about three fields: JobTitle, SalaryMin, and SalaryMax

So assuming yearly_jobs contains just the Per Year jobs, as in the above step, you can use jq to get you a line-by-line list of JobTitle, SalaryMin, and SalaryMax`:

simple_rows=$(echo $yearly_jobs | jq '. | {JobTitle, SalaryMin, SalaryMax}')

Collecting unique job titles and using read-while

Regarding the requirement that the output include the 25 most-frequent job titles…

Here's another potential pit-fall, given the format of the data and Bash's handling of whitespace. Using a for-loop here may not be optimal…so here's the code for the proper read-while loop, using standard input redirection and a command subprocess…two things I've skimmed over in class.

simple_rows=$(echo $yearly_jobs | jq '. | {JobTitle, SalaryMin, SalaryMax}')

while read line; do 
  # remember that each line contains something like:
  #   50   Some Job Title
  title=$(echo $line | grep -oE '[[:alpha:]].+')
  
  # RECENT FIX:
  # $rows  will filter $simple_rows to pick the rows of the job title
  filtered_rows=$(echo $simple_rows | jq "select(.JobTitle == \"$title\")" )

  min=$(echo $filtered_rows | jq -r '.SalaryMin' | tr -d '$' | tr -d ',' | sort -n | head -n 1)

  ## Get the max on your own
  echo "Finish the rest of your steps here to get the max, and print out the proper line as in the Deliverables"
  
  ## Echo the proper format as specified in the requirements

  # The done < ... is done for you
done < <(echo $simple_rows | jq -r '.JobTitle' | sort | uniq -c | sort -rn | head -n 25)

You can use this loop…but run that command encased in the <(...) to make sure you know what line contains.

Solution

scraper.sh

td="./data-hold/scrapes/$(date +%Y-%m-%d_%H00)"
mkdir -p $td

for snum in $series; do 
  # We always have to fetch the first page of the series
  echo "Fetching series $snum, page 1"
  curl -s "https://data.usajobs.gov/api/jobs?Series=$snum&NumberOfJobs=250&Page=1" -o "$td/$snum-1.json"

  # now parse the first page to find the number of pages
  # remaining
  total_pages=$(cat "$td/$snum-1.json" | jq -r '.Pages')
  # if $total_pages is less than 2, this for-loop doesn't
  # execute
  for p in $(seq 2 $total_pages); do
    echo "Fetching series $snum, page $p"
    curl -s "https://data.usajobs.gov/api/jobs?Series=$snum&NumberOfJobs=250&Page=$p" -o "$td/$snum-$p.json"
  done
done

analyzer.sh

datadir="./data-hold/scrapes/$1"
yearly_jobs=$(cat $datadir/*.json | jq '.JobData[] | select(.SalaryBasis == "Per Year")')
# trimming the data into simple_rows 
# is actually probably unnecessary, but it doesn't hurt
simple_rows=$(echo $yearly_jobs | jq '. | {JobTitle, SalaryMin, SalaryMax}')

# easier to read left-to-right pipe notation than what I proposed above...but
# the effect is the same
echo $simple_rows | jq -r '.JobTitle' | sort | uniq -c | \
 sort -rn | head -n 25 | \
while read -r line; do 
  title=$(echo $line | grep -oE '[[:alpha:]].+')
  filtered_rows=$(echo $simple_rows | jq "select(.JobTitle == \"$title\")" )
  min=$(echo $filtered_rows | jq -r '.SalaryMin' | tr -d '$' | tr -d ',' | sort -n | head -n 1)
  max=$(echo $filtered_rows | jq -r '.SalaryMax' | tr -d '$' | tr -d ',' | sort -rn | head -n 1)
done