Better Know a Former Congressmember with Grep

The overview to a four-part homework assignment in looking up and comparing lobbying and U.S. Congressional activity.

This page serves as the overview for the 4-part homework project on using computational techniques to automate the collection, parsing, and filtering of data related to lobbying activity and the United States Congress.

(This overview and the assignments are still under construction. Expect a due date of early March)

The title of this project is inspired by The Colbert Report's 435-part series, Better Know a District.

Find something interesting, write a program to count it

This mini-project is meant to be both a review of programming concepts and an example of how simple (but brute-force-powered) data-filtering techniques can (and should) be applied to interesting information problems, which is basically the theme of this entire course.

The abstract goal of this project: match a list of names to another list (e.g. lobbying activity) that contains names and see if the matches contain anything interesting.

But what is "interesting"? That word means entirely different things, depending on whether you are a government official, journalist, public advocate, academic, or hedge-fund analyst.

Off the top of my head, for the scope of this assignment, a shortlist of possibly interesting things:

Former legislators who are now lobbyists
Former legislators, who are not yet of retirement age, but who are not on the lobbying list
Former Congress staffmembers who are now lobbyists
When these former legislators and staffmembers became lobbyists
What issues these former legislators and staffmembers have lobbied on
The date of lobbying activity relative to legislative activity

So what is interesting is not a computational problem. But gathering the information and filtering it is most definitely a problem for a computer to solve for us. Thanks to the hard work and advocacy of groups such as the Sunlight Foundation and the Center for Responsive Politics, as well as the many policymakers and journalists who effected change in response to scandal and controversy, we have datasets that can be relatively easy for the computer to compile, leaving us to deal with the interesting work of finding something interesting in the data.

While the programming needed to make a computer collect and filter the data is relatively trivial (you might be able to do all of it under 50 lines, in Bash), the collection and filtering of data is not trivial. As you'll soon see, one: there's a lot of data, and two: the origin and purpose of each dataset present various challenges when trying to join them for cross-referencing and analysis.

In other words, even though this data is public and easily accessible, it is not easily usable. In this project, we'll see how much we can make it usable.

The data work

While this is thematically just one project, I've broken up the data-collection parts into their own assignments, as they do deal with different data domains and challenges. And also to keep you from trying to cram this entire assignment in the night before (TBA, but probably early March).

Each part of this project will have its own page (TBA). For now, the tasks are divided into:

1. Collecting post-employment and historical legislator data

Post Employment notifications via the U.S. House Office of the Clerk
Post Employment Lobbying Restrictions via the U.S. Senate
Former members of Congress via the unitedstates/congress Github project.

2. Collecting Congressional staff data from expenditure reports

Senate expenditure reports via Sunlight Foundation, "Now it's easier to account for how the Senate spends your money")
House expenditure reports via Sunlight Foundation, "House Expenditure Reports Database"

3. Collecting lobbyist data from the public lobbying database

Downloadable lobbying databases via the U.S. Senate

Be sure to check out the web interface for the Senate lobbying database, as well as the information and guidance on the Lobbying Disclosure Act

4. Joining the data and finding names matches

The end result of steps 1 through 3 is to create data files that can be used in Step 4. The data files will basically be an arrangement of fields common to all the datasets, and fields particular to each dataset but useful to keep track of:

last_name	first_name	date	description	another_description	something_else

For example, from the Senate expenditure reports, the parsed data fields might look like this:

DOE	JOHN N	2012	SECRETARY OF THE SENATE - LEGISLATIVE SERVICES Funding Year 2012 SALARIES, OFFICERS AND EMPLOYEES, SENATE	REPORTER OF DEBATES	50,100.25

Virtually every technique required to collect and filter the data, we've used in past assignments, including:

In particular, this project most resembles the challenges of the death rows assignment, in which you're gathering related data from different sources (and formats) and reconciling them.

Getting better at knowing

One of my favorite segments from The Colbert Report is Better Know a District, in which he does his part to introduce America to our many representatives, letting us know their accomplishments and other important issues, such as their grasp of the Ten Commandments and whether cocaine is fun.

Despite the inherent risk to legislators, the "434-part" series managed to air more than 80 segments. But it just underscores how little we know of each of our sitting legislators. And also, how many of them are there.

And we even know less of past Congressmembers. From 2005 to 2014, when "Better Know a District" first aired, about 350 House members – and 80 Senators – have left Congress. Some of them are retired and others found work to do. But there's no LinkedIn for former members of Congress. Which isn't surprising, since they aren't being paid by taxpayers to act on our behalf, and so there is less interest in how former Congressmembers choose to spend their time.

However, what if they spend their time on things that impact the American public in interesting ways? And what if that work is based off of, or helped by, the work they did while under the taxpayers' employ? Well, then this becomes an interesting data problem.

More notes to come…

Project tree

This is probably what the project folder structure will look like:

compciv
|___homework
    |__congress-lobbying/
       |___expenditures/
           |___helper.sh
           |___parser.sh
           |___parsed_house_expenditures.psv
           |___parsed_senate_expenditures.psv
           |___data-hold
              |____senate
              |____house
              

       |___post_employment/
           |___helper.sh
           |___parser.sh
           |___parsed_historical_congress_legislators.psv
           |___parsed_post_senate_employment.psv
           |___parsed_post_house_employment.psv
           |___data-hold
              |____senate
              |____house
              

       |___public_filings/
           |___helper.sh
           |___parser.sh           
           |___parsed_lobbying_filings.psv
           |___data-hold/

More notes to come…