Exploring Congressional Twitter data as JSON

Basic JSON parsing exercise using what Congress tweets.

Due: Tuesday, January 27
Points: 5

An exercise in JSON parsing and understanding object and array data structures.

Deliverables

  • A folder named "congress-twitter" in your Github repo

  • The 'jq-practice.sh' script

    This single script, jq-practice.sh, when run, will simply print to standard output the answers to every question asked in the assignment, in this format:

        1.
        "Brown"
        2.
        "D000604"
        3.
        "2001-01-03"
        4.
        "Robert"
        "Casey"
        "1960-04-13"
        5.
    

    So yes, it will print out a lot of lines when finished. You can see the full answers at the bottom. So basically, adjust your script until it matches the given output.

  • Think of this has a quiz about JSON and jq, while letting you see an example of how Congressional and Twitter data is structured.

    Make sure you have jq installed and know the basics of parsing JSON with it. Some of the problems below will cover some advanced jq usage, which you can find in the manual. However, some of the problems have already been done for you, and none of the unfinished problems requires knowing syntax than what was either in the JSON tutorial or in one of the already finished problems.

    Even though this homework calls for one script, jq-practice.sh, to contain all the answers, you should do the problems one at a time. When you're done with all of them, you then just stuff it into jq-practice.sh and it should produce the output given at the bottom of this page.

    Installation data

    You don't have to create a helper.sh, just run these steps (assuming you've already created a congress-twitter homework folder)

    cd ~/compciv/homework/congress-twitter
    mkdir data-hold
    cd data-hold
    curl -o x.zip http://stash.compciv.org/congress-twitter/json/jq-congress-twitter-exercise.zip
    unzip x.zip
    rm -f x.zip
    cd ..
    

    The questions

    Questions marked with asterisks are already done for you. Just copy them into your jq-practice.sh script (in order). And try out the commands before moving on, as they almost always relate to other problems in this quiz.

    1. * In legislators-current.json, print the last name of the first Congressmember listed in the array

    Remember that the file has an array of legislators. And that the first member of an array is at the 0th position.

    And the output of jq is just text that we can filter with all the other tools we've used before. If you had wanted to remove the quotation marks, you could pipe the result into tr -d '"'

    echo '1.'
    cat data-hold/legislators-current.json | jq '.[0] .name .last'
    
    2. In legislators-current.json, print the bioguide id of the 112th Congressmember

    Same pattern as the problem above.

    echo '2.'
    # you have to do the rest yourself!
    
    3. * In legislators-current.json, print the start date of the 3rd term of the 14th-listed Congressmember

    Here is an example of accessing an array with an array. Inspect the 14th-element with jq before looking at .terms[], and then at .terms[2], and then finally, with .start, to see how the selectors narrow down the selected data.

    echo '3.'
    cat data-hold/legislators-current.json | jq '.[13] .terms[2] .start'
    
    4. * In legislators-current.json, print the first name, last name, and the birthdate date of the 5th-listed congressmember (one item on each line is fine)

    Notice the use of the pipe inside the string that's passed into jq. This is a pipe particular to jq…so the filtering is happening inside of jq.

    The below command can be read as: Take the fifth member of the index and pipe it into the selectors of .name .first, .name .last, and .bio .birthday. The key here is to just pay attention to how jq has its own syntax…like Bash, it is itself an interpreter program.

    echo '4.'
    cat data-hold/legislators-current.json | \
      jq '.[4] | .name .first, .name .last, .bio .birthday'
    
    5. * In legislators-current.json, print the birthdates of the 10 youngest congressmembers
    cat data-hold/legislators-current.json | \
      jq '.[] .bio .birthday' | sort -r | head -n 10 | tr -d '"'
    
    6. In legislators-current.json, print the list of religious affiliations (including null), in descending-order of frequency
    echo '6.'
    # do the rest yourself!
    
    7. * In legislators-current.json, Print the number of incoming Congressmembers

    The --raw-output flag of jq can also be used to remove extraneous quotation marks and the other formatting that jq applies.

    The grep here assumes that terms[0] contains the very first term of every Congressmember, and if the start date for that term is in 2015, then it is a freshman congressmember.

    echo '7.'
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[0] .start' | sort | \
      cut -d '-' -f 1 | grep -c '2015'
    
    8. In legislators-current.json, Print the start year of the longest-serving Congressmember
    echo '8.'
    # do the rest yourself
    
    9. * In legislators-current.json, print the number of Republican congressmembers

    The special notation, .terms[-1:0], let's us choose a range of items in an array, which will produce another array, i.e. a subarray. In this case, by using a negative number, I'm counting backwards. And then I use another brace notation, [0], to pick the first element of this subarray.

    The bigger picture here is that Congressmembers don't simply have one party. Some of them may have chosen to switch parties at some point.

    echo '9.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[-1:][0].party' | grep -c 'Republican'
    
    10. In legislators-current.json, print the number of Congressmembers who are neither Republican nor Democratic
    echo '10.'
    # do the rest yourself 
    
    11. * In legislators-current.json, get the list of all .rss_url entries, with duplicates removed, and in alphabetical order, and print just the first 10 lines:

    The value null is used in JSON to indicate a missing value:

    echo '11.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[] .rss_url' | grep -v 'null' | sort | \
        uniq | head -n 10
    
    12. * In legislators-social-media.json, print the number of Twitter accounts

    Note that we've switched to new JSON file now.

    echo '12.'
    
    cat data-hold/legislators-social-media.json | jq '.[] .social .twitter' | grep -v 'null' | wc -l
    
    13. In legislators-social-media.json, print the number of Facebook accounts

    Same as above, but looking for Facebook

    echo '13.'
    
    14. * In legislators-social-media.json, print the first 10 lines of a comma-separated list of all bioguideIDs, along with Twitter screen names (if they have one), in alphabetical order of Twitter screen names

    This problem shows off a feature of jq: Array construction, in which we can specify values to be wrapped up in an Array. Which we then pass into the

    Also, notice how within the string that's passed into jq as an argument, that the pipe filter is being used twice. This is not a bash pipe (again, it's within the call to jq)…however, it functions pretty much the same way: we pipe the input from .[] into the Array constructor, and then into @csv, which produces a comma-separated list.

    If you don't get it all, don't worry, but just pay attention to how the argument to jq has its own special syntax, and pipes, that it uses to process data, which then gets passed into the other Bash filter (head -n 10)

    echo '14.'
    cat data-hold/legislators-social-media.json | \
      jq --raw-output '.[]  | [.id .bioguide, .social .twitter] | @csv' \
      | head -n 10
    
    15. In legislators-social-media.json, repeat Step 14, but filter it to print only the lines that do not have a Twitter account listed

    This is the same code as previous, but just filter it through something (like a grep) and then into head

    echo '15.'
    # finish this
    
    16. * In congress-twitter-profiles.json, print the number of verified Twitter accounts.

    Note that we're using a different data file now, congress-twitter-profiles.json

    echo '16.'
    cat data-hold/congress-twitter-profiles.json | \
      jq '.[] .verified' | grep 'true' | wc -l
    
    17. In congress-twitter-profiles.json, print the largest follower count

    Similar to what you've done before, piping output into sort -n. This just requires you to get a little acquainted with the Twitter profile data structure.

    echo '17.'
    # finish this
    
    18. * In congress-twitter-profiles.json, print the first 10 lines of: CSV of name, screen_name, followers_count, verified, and created_at (when the account was created):

    This is just another example of the syntax of jq, letting us pipe into its @csv filter, to output data that is in CSV format:

    echo '18.'
    cat data-hold/congress-twitter-profiles.json | \
      jq --raw-output '.[] | [.name, .screen_name, .followers_count, .verified, .created_at] | @csv' |\
      head -n 10
    
    19. In congress-twitter-profiles.json, print the first 10 lines of: a CSV of screen_name, statuses_count, followers_count, and the created_at of their most recent tweet

    This is nearly the exact same as Problem 18, with a couple of different fields being selected.

    (The phrasing "their most recent tweet" might confusingly imply something about the data structure: check the Twitter API docs for users/show for clarification)

    echo '19.'
    

    The answers

    The output of bash jq-practice.sh should look exactly like this:

    1.
    "Brown"
    2.
    "D000604"
    3.
    "2001-01-03"
    4.
    "Robert"
    "Casey"
    "1960-04-13"
    5.
    1984-07-02
    1983-03-30
    1981-05-28
    1981-04-12
    1980-11-16
    1980-10-04
    1980-06-16
    1980-04-18
    1980-03-01
    1980-01-30
    6.
     355 null
      42 Roman Catholic
      24 Baptist
      19 Methodist
      15 Jewish
      15 Episcopalian
      15 Catholic
      14 Presbyterian
       7 Latter Day Saints
       6 Protestant
       6 Lutheran
       5 Unknown
       5 Christian
       3 African Methodist Episcopal
       2 Southern Baptist
       2 Christian Scientist
       1 United Methodist
       1 Seventh Day Adventist
       1 Nazarene
       1 Episcopal
       1 Church of Christ
    7.
    64
    8.
    1965
    9.
    301
    10.
    3
    11.
    feed://olson.house.gov/common/rss/?rss=82
    http://aderholt.house.gov/common/rss//index.cfm?rss=20
    http://adriansmith.house.gov/rss.xml
    http://algreen.house.gov/rss.xml
    http://amash.house.gov/rss.xml
    http://amodei.house.gov/common/rss//?rss=49
    http://austinscott.house.gov/index.php?format=feed&type=rss
    http://ayotte.senate.gov/rss/?p=news
    http://barletta.house.gov/common/rss//index.cfm?rss=25
    http://barr.house.gov/rss.xml
    12.
         490
    13.
         439
    14.
    "Y000064","RepToddYoung"
    "Y000063","RepKevinYoder"
    "Y000062","RepJohnYarmuth"
    "Y000033","RepDonYoung"
    "W000809","Rep_SteveWomack"
    "W000808","RepWilson"
    "W000806","RepWebster"
    "W000805","MarkWarner"
    "W000804","RobWittman"
    "W000802","SenWhitehouse"
    15.
    "S001177",
    "M001144",
    "M000689",
    "D000604",
    "C001049",
    "C001037",
    "A000368",
    "B001245",
    "G000546",
    "W000810",
    "V000127",
    "C001075",
    "F000457",
    "K000367",
    "G000555",
    "G000556",
    16.
         463
    17.
    1938080
    18.
    "Aaron Schock","aaronschock",25383,true,"Thu Mar 12 14:04:15 +0000 2009"
    "Ander Crenshaw","AnderCrenshaw",8229,true,"Fri Feb 06 01:48:11 +0000 2009"
    "Rep. Austin Scott","AustinScottGA08",7326,true,"Thu Jan 06 16:01:46 +0000 2011"
    "RepBThompson","BennieGThompson",4401,true,"Wed Oct 14 20:47:03 +0000 2009"
    "Betty McCollum","BettyMcCollum04",5784,true,"Tue Mar 06 20:52:11 +0000 2012"
    "Bill Pascrell, Jr.","BillPascrell",4201,true,"Tue Sep 15 17:54:32 +0000 2009"
    "Blumenauer Media","BlumenauerMedia",2732,true,"Tue Apr 27 21:48:06 +0000 2010"
    "U.S. Rep. Bob Latta","boblatta",14499,true,"Fri Jul 11 18:33:24 +0000 2008"
    "Brad Sherman","BradSherman",7093,true,"Fri Apr 10 13:38:03 +0000 2009"
    "Dutch Ruppersberger","Call_Me_Dutch",4983,true,"Thu May 26 14:39:08 +0000 2011"
    19.
    "aaronschock",2502,25383,"Wed Jan 21 02:59:48 +0000 2015"
    "AnderCrenshaw",827,8229,"Tue Jan 20 21:39:51 +0000 2015"
    "AustinScottGA08",1388,7326,"Wed Jan 21 02:21:25 +0000 2015"
    "BennieGThompson",417,4401,"Wed Jan 21 02:52:33 +0000 2015"
    "BettyMcCollum04",2449,5784,"Wed Jan 21 03:45:08 +0000 2015"
    "BillPascrell",803,4201,"Tue Jan 13 16:05:11 +0000 2015"
    "BlumenauerMedia",356,2732,"Wed Jul 30 20:27:37 +0000 2014"
    "boblatta",4865,14499,"Wed Jan 21 03:48:27 +0000 2015"
    "BradSherman",516,7093,"Wed Jan 21 02:40:53 +0000 2015"
    "Call_Me_Dutch",1584,4983,"Tue Jan 20 20:51:43 +0000 2015"
    

    Solution

    # 1. * In `legislators-current.json`, print the last name of the __first__ Congressmember listed in the array
    echo '1.'
    cat data-hold/legislators-current.json | jq '.[0] .name .last'
    
    # 2. In `legislators-current.json`, print the bioguide id of the 112th Congressmember
    echo '2.'
    cat data-hold/legislators-current.json | jq '.[111] .id .bioguide'
    
    
    # 3. In `legislators-current.json`, print the start date of the 3rd term of the 14th-listed Congressmember
    echo '3.'
    cat data-hold/legislators-current.json | jq '.[13] .terms[2] .start'
    
    
    # 4. * In `legislators-current.json`, print the first name, last name, and the birthdate date of the 5th-listed congressmember (one item on each line is fine) 
    echo '4.'
    cat data-hold/legislators-current.json | \
      jq '.[4] | .name .first, .name .last, .bio .birthday'
    
    # 5. * In `legislators-current.json`, print the birthdates of the 10 youngest congressmembers
    echo '5.'
    
    cat data-hold/legislators-current.json | \
      jq '.[] .bio .birthday' | sort -r | head -n 10 | tr -d '"'
    
    # 6. In `legislators-current.json`, print the list of religious affiliations (including `null`), in descending-order of frequency
    echo '6.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .bio .religion' | sort | uniq -c | sort -r
    
    
    # 7. * In `legislators-current.json`, Print the number of incoming Congressmembers
    echo '7.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[0] .start' | sort | cut -d '-' -f 1 | grep -c '2015'
    
    
    # 8. In `legislators-current.json`, Print the start year of the longest-serving Congressmember
    echo '8.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[0] .start' | sort | cut -d '-' -f 1 | head -n 1
    
    
    # 9. * In `legislators-current.json`, print the number of Republican congressmembers
    echo '9.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[-1:][0].party' | grep -c 'Republican'
    
    # 10. In `legislators-current.json`, print the number of Congressmembers who are neither Republican nor Democratic
    echo '10.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[-1:][0].party' | grep -cvE 'Republican|Democrat' 
    
    # 11. * In `legislators-current.json`, print the first 10 lines of the list of  
    #   all .rss_url entries, with duplicates removed, and in alphabetical order
    echo '11.'
    
    cat data-hold/legislators-current.json | \
      jq --raw-output '.[] .terms[] .rss_url' | grep -v 'null' | sort | uniq | head -n 10
    
    # 12. * In `legislators-social-media.json`, print the number of Twitter accounts
    echo '12.'
    
    cat data-hold/legislators-social-media.json | jq '.[] .social .twitter' | grep -v 'null' | wc -l
    
    # 13. In `legislators-social-media.json`, print the number of Facebook accounts
    echo '13.'
    
    cat data-hold/legislators-social-media.json | jq '.[] .social .facebook' | grep -v 'null'  | wc -l
    
    # 14. * In `legislators-social-media.json`, * Print the first 10 lines of a comma-separated list of all bioguideIDs, along with Twitter screen names (if they have one), in alphabetical order of Twitter screen names
    echo '14.'
    cat data-hold/legislators-social-media.json | \
      jq --raw-output '.[]  | [.id .bioguide, .social .twitter] | @csv' | head -n 10
    
    
    # 15. In `legislators-social-media.json`, repeat Step 14, but filter it to print only the lines that do not have a Twitter account listed
    echo '15.'
    cat data-hold/legislators-social-media.json | \
      jq --raw-output '.[]  | [.id .bioguide, .social .twitter] | @csv' | grep -v ',"'
    
    # 16. * In `congress-twitter-profiles.json`, print the number of verified Twitter accounts.
    echo '16.'
    cat data-hold/congress-twitter-profiles.json | \
      jq '.[] .verified' | grep 'true' | wc -l
    
    
    # 17. In `congress-twitter-profiles.json`, print the largest follower count
    echo '17.'
    cat data-hold/congress-twitter-profiles.json | \
      jq '.[] .followers_count' | sort -n | tail -n 1
    
    # 18. * In `congress-twitter-profiles.json`, print the first 10 lines of: CSV of name, screen_name, followers_count, verified, and created_at (when the account was created):
    echo '18.'
    cat data-hold/congress-twitter-profiles.json | \
      jq --raw-output '.[] | [.name, .screen_name, .followers_count, .verified, .created_at] | @csv' |\
      head -n 10
    
    # 19. In `congress-twitter-profiles.json`, print the first 10 lines of: a CSV of screen_name,
    # statuses_count, followers_count, created_at of their most recent tweet
    echo '19.'
    cat data-hold/congress-twitter-profiles.json | \
      jq --raw-output '.[] | [.screen_name, .statuses_count, .followers_count, .status .created_at] | @csv' | \
      head -n 10