Most of the popular API and data services use the JSON data format, so we'll learn how it's used to serialize interesting information, and how to use the jq to parse it at the command-line.
JSON stands for JavaScript Object Notation and is nearly ubiquitous as a data format, for its lightweight nature and (relatively) ease of human-readability.
jq is a program described as "sed
for JSON data":
You can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
If you've haven't installed the pup HTML parsing tool, or any other tool which required you to edit your ~/.bashrc
's' PATH
and create the ~/bin_compciv
directory, then follow the instructions here: Installing programs for your personal Stanford account.
To install jq:
# download directly into ~/bin_compciv
curl http://stedolan.github.io/jq/download/linux64/jq -o ~/bin_compciv/jq
# give it executable permissions
chmod a+x ~/bin_compciv/jq
Check out the tutorial on jq here. The full manual can be found here.
JSON is more verbose, but easier to read than CSV.
CSV:
firstname,lastname,middlename,birthyear,deathyear,gender,city,state
Alexander,Smith,Johnson,1918,1955,M,Madison,Wisconsin
Lucy Jo Mary,Carrie-Paulson,,1967,,F,Omaha,Nebraska
Jeremiah,"Springfield, Jr.",Jebediah,1942,2003,M,Kalamazoo,Michigan
JSON:
[
{
"firstname": "Alexander",
"lastname": "Smith",
"middlename": "Johnson",
"birthyear": 1918,
"deathyear": 1955,
"gender": "M",
"city": "Madison",
"state": "Wisconsin"
},
{
"firstname": "Lucy Jo Mary",
"lastname": "Carrie-Paulson",
"middlename": null,
"birthyear": 1967,
"deathyear": null,
"gender": "F",
"city": "Omaha",
"state": "Nebraska"
},
{
"firstname": "Jeremiah",
"lastname": "Springfield, Jr.",
"middlename": "Jebediah",
"birthyear": 1942,
"deathyear": 2003,
"gender": "M",
"city": "Kalamazoo",
"state": "Michigan"
}
]
Inherent to the JSON data format is the ability to encapsulate data structures known as Objects consisting of attribute-value pairs
The object below has the attributes first_name
and last_name
, with the values of John
and Smith
, respectively:
{
"first_name": "John",
"last_name": "Smith"
}
JSON objects can also contain Arrays, which can be thought of as lists of elements. Below, the things_carried
attribute has an Array containing apples
, hat
, and harmonica
.
{
"first_name": "John",
"last_name": "Smith",
"things_carried": [
"apples",
"hat",
"harmonica"
]
}
What makes JSON more flexible than flat data structures (such as CSV and other spreadsheet-type formats) is the ability to nest objects and arrays:
{
"first_name": "John",
"last_name": "Smith",
"things_carried": [
"apples",
"hat",
"harmonica"
],
"children": [
{
"first_name": "Bobby Sue",
"last_name": "Smith"
},
{
"first_name": "John Jr",
"last_name": "Smith"
}
]
}
The unitedstates/congress-legislators Github repo contains a wealth of useful, serialized information about the U.S. Congress. From their current legislators file, I've extracted the entry for newly-elected Senator Joni Ernst and converted it to a JSON, which you can download here.
user@host:~$ curl 'http://stash.compciv.org/congress-twitter/json/joni-ernst.json' > ernst.json
user@host:~$ cat ernst.json | jq '.'
The formatted output:
{
"id": {
"bioguide": "E000295",
"thomas": "02283",
"fec": [
"S4IA00129"
],
"govtrack": 412667,
"opensecrets": "N00035483",
"lis": "S376"
},
"name": {
"first": "Joni",
"last": "Ernst",
"official_full": "Joni Ernst"
},
"bio": {
"gender": "F",
"birthday": "1970-07-01"
},
"terms": [
{
"type": "sen",
"start": "2015-01-06",
"end": "2021-01-03",
"state": "IA",
"class": 2,
"state_rank": "junior",
"party": "Republican",
"url": "http://www.ernst.senate.gov",
"address": "825 B&C Hart Senate Office Building Washington DC 20510",
"office": "825 B&c Hart Senate Office Building",
"phone": "202-224-3254"
}
]
}
Senator Ernst is new to Congress (and the first woman to represent Iowa in Congress), so her JSON-object representation is brief enough for now to read on one page. The object includes 4 nested objects, under the attributes, id
, name
, bio
, terms
.
Use jq to view each of these attributes separately, by prepending a dot to the desired attribute:
user@host:~$ cat ernst.json | jq '.name'
{
"first": "Joni",
"last": "Ernst",
"official_full": "Joni Ernst"
}
user@host:~$ cat ernst.json | jq '.id'
{
"bioguide": "E000295",
"thomas": "02283",
"fec": [
"S4IA00129"
],
"govtrack": 412667,
"opensecrets": "N00035483",
"lis": "S376"
}
To access a nested attribute, for example, the first
part of the name
object, simply list the attributes in sequence:
user@host:~$ cat ernst.json | jq '.name.first'
"Joni"
user@host:~$ cat ernst.json | jq '.bio .birthday'
"1970-07-01"
Think of an array as a list. In JSON, square brackets are used to enclose an array of objects or values. Below, the attribute fruits points to an array of string values, "apples"
, "oranges"
, "pears"
:
{
"fruits": [
"apples",
"oranges",
"pears"
]
}
In the case of the Congressmember JSON, each member has a terms
attribute, pointing to an array of objects that look like this:
{
"terms": [
{
"type": "sen",
"start": "2015-01-06",
"end": "2021-01-03",
"state": "IA"
/* ... */
}
]
}
Here, we use jq's notation to access Senator Ernst's terms
as an attribute: take care to notice the square brackets which denote the array that is contained in terms
:
user@host:~$ cat ernst.json | jq '.terms'
[
{
"type": "sen",
"start": "2015-01-06",
"end": "2021-01-03",
"state": "IA",
"class": 2,
"state_rank": "junior",
"party": "Republican",
"url": "http://www.ernst.senate.gov",
"address": "825 B&C Hart Senate Office Building Washington DC 20510",
"office": "825 B&c Hart Senate Office Building",
"phone": "202-224-3254"
}
]
Because Senator Ernst has so far served only one term, there is only one object in the .terms
array. However, trying to access the start
attribute, using just the dot notation for attributes, will result in an error:
user@host:~$ cat ernst.json | jq '.terms .start'
jq: error: Cannot index array with string
The individual objects in an array are indexed by number, starting from zero. In other words, if we want to get the start attribute for Senator Ernst's 1st term, we tell jq that we want the 0th element, and we use the square bracket notation when accessing an array, in this case, the terms
array.
In the result below, take particular note of the lack of square brackets: by specifying a specific element in terms
, we get a single object in return:
user@host:~$ cat ernst.json | jq '.terms[0]'
{
"type": "sen",
"start": "2015-01-06",
"end": "2021-01-03",
"state": "IA",
"class": 2,
"state_rank": "junior",
"party": "Republican",
"url": "http://www.ernst.senate.gov",
"address": "825 B&C Hart Senate Office Building Washington DC 20510",
"office": "825 B&c Hart Senate Office Building",
"phone": "202-224-3254"
}
Now try to retrieve the start
attribute:
user@host:~$ cat ernst.json | jq '.terms[0] .start'
"2015-01-06"
Try accessing the 1st element in terms
; if Senator Ernst has only served one term, and that term is accessible via the 0th element, think to yourself what the 1st element in the terms
array will contain:
user@host:~$ cat ernst.json | jq '.terms[1] .start'
null
The terms
array for Sen. Ernst is still an array, even with one object. But let's see how an array of multiple objects works: we can use Rep. Nancy Pelosi as an example, as she's been in Congress for multiple terms:
user@host:~$ curl 'http://stash.compciv.org/congress-twitter/json/nancy-pelosi.json' > pelosi.json
# inspect Rep. Pelosi's data object
user@host:~$ cat pelosi.json | jq '.'
How many terms has Pelosi served, exactly? There may be a specific filter in jq that returns a count, but let's just use the good ol' wc tool and pipe the result of jq into it:
# re-familiarize yourself with what this produces:
user@host:~$ cat pelosi.json | jq '.terms'
# Notice the subtle difference between that output and the
# output of this:
user@host:~$ cat pelosi.json | jq '.terms[]'
# Think about what this should return as output before
# running it:
user@host:~$ cat pelosi.json | jq '.terms[] .start'
# And now we pipe to wc -l
user@host:~$ cat pelosi.json | jq '.terms[] .start' | wc -l
15
You can continue to practice your JSON selectors. Or just move on to the entire Congress, because what we've learned so far should apply to all Congressmember objects.
user@host:~$ curl 'http://stash.compciv.org/congress-twitter/json/legislators-current.json' > legislators.json
With the JSON files for Rep. Pelosi and Sen. Ernst, we were dealing with single Congressmember objects. This legislators.json
contains an array of those Congressmember objects. So this query to find the name
attribute will fail:
user@host:~$ cat legislators.json | jq '.name'
jq: error: Cannot index array with string
So we have to use the array selector (the []
) to access individual objects. To find the name
object of the first Congressmember object (again, remember that elements in the array are numbered starting with 0
):
user@host:~$ cat legislators.json | jq '.[0] .name'
{
"first": "Sherrod",
"last": "Brown",
"official_full": "Sherrod Brown"
}
Keep playing with jq and check out its manual for more examples. We've just gone over some of the concepts on how JSON is structured. In later exercises, we'll use some of jq's more advanced features.
Just another example of JSON data to parse:
Check out Github's status page, which has some nice visualizations about their current uptime and response speed.
Here's the API and JSON format that is behind that page: status.github.com/api
To get the current status as raw JSON:
user@host:~$ curl -s https://status.github.com/api/status.json
{"status":"good","last_updated":"2015-01-21T16:58:42Z"}
Passing it through jq's parser, with the most basic option, will make it look nicer (with colors):
user@host:~$ curl -s https://status.github.com/api/status.json |\
jq '.'
{
"status": "good",
"last_updated": "2015-01-21T16:59:01Z"
}
Do some basic selection:
user@host:~$ curl -s https://status.github.com/api/status.json |\
jq '.status'
"good"
Select both elements:
curl -s https://status.github.com/api/status.json |\
> jq '.status,.last_updated'
"good"
"2015-01-21T17:02:57Z"
The jq tool outputs formatted text that looks nice, but often we just need it in a raw form that can be filtered into our other command-line tools – notice how the --raw-output
option removes the quotation marks from the output:
user@host:~$ curl -s https://status.github.com/api/status.json |\
jq --raw-output '.status,.last_updated'
good
2015-01-21T17:02:57Z
As per the jq tutorial, you can take a look at what Github records when you push your commits to your compciv
repo:
# Just the raw data
user@host:~$ mygiturl='https://api.github.com/repos/hello-stanford/compciv/commits?per_page=1'
user@host:~$ curl -s $mygiturl
# pass it into jq for nicer formatting
user@host:~$ curl -s $mygiturl | jq '.'
If you're really that interested in Github's API and its other endpoints, check out the documentation here.