Parsing JSON with jq

JSON is a lightweight format that is nearly ubiquitous for data-exchange. jq is a command-line tool for parsing JSON.

Most of the popular API and data services use the JSON data format, so we'll learn how it's used to serialize interesting information, and how to use the jq to parse it at the command-line.

JSON - a lightweight data format

JSON stands for JavaScript Object Notation and is nearly ubiquitous as a data format, for its lightweight nature and (relatively) ease of human-readability.

Using jq

jq is a program described as "sed for JSON data":

You can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.

Installing jq

If you've haven't installed the pup HTML parsing tool, or any other tool which required you to edit your ~/.bashrc's' PATH and create the ~/bin_compciv directory, then follow the instructions here: Installing programs for your personal Stanford account.

To install jq:

# download directly into ~/bin_compciv
curl http://stedolan.github.io/jq/download/linux64/jq -o ~/bin_compciv/jq
# give it executable permissions
chmod a+x ~/bin_compciv/jq

Check out the tutorial on jq here. The full manual can be found here.

JSON compared to CSV

JSON is more verbose, but easier to read than CSV.

CSV:

firstname,lastname,middlename,birthyear,deathyear,gender,city,state
Alexander,Smith,Johnson,1918,1955,M,Madison,Wisconsin
Lucy Jo Mary,Carrie-Paulson,,1967,,F,Omaha,Nebraska
Jeremiah,"Springfield, Jr.",Jebediah,1942,2003,M,Kalamazoo,Michigan

JSON:

[
  {
    "firstname": "Alexander",
    "lastname": "Smith",
    "middlename": "Johnson",
    "birthyear": 1918,
    "deathyear": 1955,
    "gender": "M",
    "city": "Madison",
    "state": "Wisconsin"
  },
  {
    "firstname": "Lucy Jo Mary",
    "lastname": "Carrie-Paulson",
    "middlename": null,
    "birthyear": 1967,
    "deathyear": null,
    "gender": "F",
    "city": "Omaha",
    "state": "Nebraska"
  },
  {
    "firstname": "Jeremiah",
    "lastname": "Springfield, Jr.",
    "middlename": "Jebediah",
    "birthyear": 1942,
    "deathyear": 2003,
    "gender": "M",
    "city": "Kalamazoo",
    "state": "Michigan"
  }
]

Objects and attribute-value pairs

Inherent to the JSON data format is the ability to encapsulate data structures known as Objects consisting of attribute-value pairs

The object below has the attributes first_name and last_name, with the values of John and Smith, respectively:

{
  "first_name": "John",
  "last_name": "Smith"
}

JSON objects can also contain Arrays, which can be thought of as lists of elements. Below, the things_carried attribute has an Array containing apples, hat, and harmonica.

{
  "first_name": "John",
  "last_name": "Smith",
  "things_carried": [
    "apples",
    "hat",
    "harmonica"
  ]
}

What makes JSON more flexible than flat data structures (such as CSV and other spreadsheet-type formats) is the ability to nest objects and arrays:

{
  "first_name": "John",
  "last_name": "Smith",
  "things_carried": [
    "apples",
    "hat",
    "harmonica"
  ],
  "children": [
    {
      "first_name": "Bobby Sue",
      "last_name": "Smith"
    },
    {
      "first_name": "John Jr",
      "last_name": "Smith"
    }
  ]
}

Sen. Joni Ernst as a JSON object

The unitedstates/congress-legislators Github repo contains a wealth of useful, serialized information about the U.S. Congress. From their current legislators file, I've extracted the entry for newly-elected Senator Joni Ernst and converted it to a JSON, which you can download here.

user@host:~$ curl 'http://stash.compciv.org/congress-twitter/json/joni-ernst.json' > ernst.json
user@host:~$ cat ernst.json | jq '.'

The formatted output:

{
  "id": {
    "bioguide": "E000295",
    "thomas": "02283",
    "fec": [
      "S4IA00129"
    ],
    "govtrack": 412667,
    "opensecrets": "N00035483",
    "lis": "S376"
  },
  "name": {
    "first": "Joni",
    "last": "Ernst",
    "official_full": "Joni Ernst"
  },
  "bio": {
    "gender": "F",
    "birthday": "1970-07-01"
  },
  "terms": [
    {
      "type": "sen",
      "start": "2015-01-06",
      "end": "2021-01-03",
      "state": "IA",
      "class": 2,
      "state_rank": "junior",
      "party": "Republican",
      "url": "http://www.ernst.senate.gov",
      "address": "825 B&C Hart Senate Office Building Washington DC 20510",
      "office": "825 B&c Hart Senate Office Building",
      "phone": "202-224-3254"
    }
  ]
}

Senator Ernst is new to Congress (and the first woman to represent Iowa in Congress), so her JSON-object representation is brief enough for now to read on one page. The object includes 4 nested objects, under the attributes, id, name, bio, terms.

Use jq to view each of these attributes separately, by prepending a dot to the desired attribute:

user@host:~$ cat ernst.json | jq '.name'
{
  "first": "Joni",
  "last": "Ernst",
  "official_full": "Joni Ernst"
}


user@host:~$ cat ernst.json | jq '.id'
{
  "bioguide": "E000295",
  "thomas": "02283",
  "fec": [
    "S4IA00129"
  ],
  "govtrack": 412667,
  "opensecrets": "N00035483",
  "lis": "S376"
}

Nested objects

To access a nested attribute, for example, the first part of the name object, simply list the attributes in sequence:

user@host:~$ cat ernst.json | jq '.name.first'
"Joni"
user@host:~$ cat ernst.json | jq '.bio .birthday'
"1970-07-01"

Arrays

Think of an array as a list. In JSON, square brackets are used to enclose an array of objects or values. Below, the attribute fruits points to an array of string values, "apples", "oranges", "pears":

{
  "fruits": [
    "apples",
    "oranges",
    "pears"
  ]
}

In the case of the Congressmember JSON, each member has a terms attribute, pointing to an array of objects that look like this:

{
  "terms": [
    {
      "type": "sen",
      "start": "2015-01-06",
      "end": "2021-01-03",
      "state": "IA"
      /* ... */
    }
  ]
}

Here, we use jq's notation to access Senator Ernst's terms as an attribute: take care to notice the square brackets which denote the array that is contained in terms:

user@host:~$ cat ernst.json | jq '.terms'
[
  {
    "type": "sen",
    "start": "2015-01-06",
    "end": "2021-01-03",
    "state": "IA",
    "class": 2,
    "state_rank": "junior",
    "party": "Republican",
    "url": "http://www.ernst.senate.gov",
    "address": "825 B&C Hart Senate Office Building Washington DC 20510",
    "office": "825 B&c Hart Senate Office Building",
    "phone": "202-224-3254"
  }
]

Because Senator Ernst has so far served only one term, there is only one object in the .terms array. However, trying to access the start attribute, using just the dot notation for attributes, will result in an error:

user@host:~$ cat ernst.json | jq '.terms .start'
jq: error: Cannot index array with string

Accessing objects in an array

The individual objects in an array are indexed by number, starting from zero. In other words, if we want to get the start attribute for Senator Ernst's 1st term, we tell jq that we want the 0th element, and we use the square bracket notation when accessing an array, in this case, the terms array.

In the result below, take particular note of the lack of square brackets: by specifying a specific element in terms, we get a single object in return:

user@host:~$  cat ernst.json | jq '.terms[0]'
{
  "type": "sen",
  "start": "2015-01-06",
  "end": "2021-01-03",
  "state": "IA",
  "class": 2,
  "state_rank": "junior",
  "party": "Republican",
  "url": "http://www.ernst.senate.gov",
  "address": "825 B&C Hart Senate Office Building Washington DC 20510",
  "office": "825 B&c Hart Senate Office Building",
  "phone": "202-224-3254"
}

Now try to retrieve the start attribute:

user@host:~$ cat ernst.json | jq '.terms[0] .start'
"2015-01-06"

Try accessing the 1st element in terms; if Senator Ernst has only served one term, and that term is accessible via the 0th element, think to yourself what the 1st element in the terms array will contain:

user@host:~$ cat ernst.json | jq '.terms[1] .start'
null

Rep. Nancy Pelosi as a JSON object

The terms array for Sen. Ernst is still an array, even with one object. But let's see how an array of multiple objects works: we can use Rep. Nancy Pelosi as an example, as she's been in Congress for multiple terms:

user@host:~$ curl 'http://stash.compciv.org/congress-twitter/json/nancy-pelosi.json' > pelosi.json
# inspect Rep. Pelosi's data object
user@host:~$ cat pelosi.json | jq '.'

How many terms has Pelosi served, exactly? There may be a specific filter in jq that returns a count, but let's just use the good ol' wc tool and pipe the result of jq into it:

# re-familiarize yourself with what this produces:
user@host:~$ cat pelosi.json | jq '.terms'
# Notice the subtle difference between that output and the 
# output of this:
user@host:~$ cat pelosi.json | jq '.terms[]'
# Think about what this should return as output before 
# running it:
user@host:~$ cat pelosi.json | jq '.terms[] .start' 
# And now we pipe to wc -l 
user@host:~$ cat pelosi.json | jq '.terms[] .start' | wc -l 
      15

All of Congress

You can continue to practice your JSON selectors. Or just move on to the entire Congress, because what we've learned so far should apply to all Congressmember objects.

user@host:~$ curl 'http://stash.compciv.org/congress-twitter/json/legislators-current.json' > legislators.json

With the JSON files for Rep. Pelosi and Sen. Ernst, we were dealing with single Congressmember objects. This legislators.json contains an array of those Congressmember objects. So this query to find the name attribute will fail:

user@host:~$ cat legislators.json | jq '.name'
jq: error: Cannot index array with string

So we have to use the array selector (the []) to access individual objects. To find the name object of the first Congressmember object (again, remember that elements in the array are numbered starting with 0):

user@host:~$ cat legislators.json | jq '.[0] .name'
{
  "first": "Sherrod",
  "last": "Brown",
  "official_full": "Sherrod Brown"
}

Keep playing with jq and check out its manual for more examples. We've just gone over some of the concepts on how JSON is structured. In later exercises, we'll use some of jq's more advanced features.

Testing the Github API

Just another example of JSON data to parse:

Status endpoint

Check out Github's status page, which has some nice visualizations about their current uptime and response speed.

Here's the API and JSON format that is behind that page: status.github.com/api

To get the current status as raw JSON:

user@host:~$ curl -s https://status.github.com/api/status.json
{"status":"good","last_updated":"2015-01-21T16:58:42Z"}

Passing it through jq's parser, with the most basic option, will make it look nicer (with colors):

user@host:~$ curl -s https://status.github.com/api/status.json |\
  jq '.'
{
  "status": "good",
  "last_updated": "2015-01-21T16:59:01Z"
}

Do some basic selection:

user@host:~$ curl -s https://status.github.com/api/status.json |\
  jq '.status'
"good"

Select both elements:

curl -s https://status.github.com/api/status.json |\
>   jq '.status,.last_updated'
"good"
"2015-01-21T17:02:57Z"

The jq tool outputs formatted text that looks nice, but often we just need it in a raw form that can be filtered into our other command-line tools – notice how the --raw-output option removes the quotation marks from the output:

user@host:~$ curl -s https://status.github.com/api/status.json |\
  jq --raw-output '.status,.last_updated'
good
2015-01-21T17:02:57Z

Commits endpoint

As per the jq tutorial, you can take a look at what Github records when you push your commits to your compciv repo:

# Just the raw data
user@host:~$ mygiturl='https://api.github.com/repos/hello-stanford/compciv/commits?per_page=1'
user@host:~$ curl -s $mygiturl

# pass it into jq for nicer formatting
user@host:~$ curl -s $mygiturl | jq '.'

If you're really that interested in Github's API and its other endpoints, check out the documentation here.