Draft proposal of a final project
Even if your programming skills aren’t very advanced, you basically know how to apply a computer’s tireless, non-complaining brute force to the majority of the problems of information: collecting and filtering information.
You have about two weeks to come up with a final project idea, which basically amounts to: solve an interesting information problem. The scope of the project need not be much different from the homeworks, except that you do the research and you come up with the idea for a program, as well as writing the program yourself.
The program you write does not have to solve the problem on its own, as any problem worth solving entirely requires many programs (and human insight). But the program must tackle a facet of the work that is critical to solving/understanding the problem, yet too menial and repetitive for a human to do, especially if it takes dozens or hundreds of hours.
Deliverables
A folder in homework/final-project
This will be where your code and other related-material for the project will reside. For now, it just needs a copy of the draft proposal.
|-compciv/
|-homework/
|--final-project/
|--draft-proposal.md
A 300+ word proposal in draft-proposal.md
Write a 300-word-or-more draft proposal in a plaintext file named draft-proposal.md
. Include:
- What your proposed project is about
- The data sources you anticipate accessing
- The general workings of the program you anticipate writing
If you have a partner for this project, both of you must include your own copy of the final-project
repo and draft-proposal.md
file
General requirements for the final project
The four-part "Better Know a Former Congressmember With Grep" homework project is a decent example for the scope of research, programming, and goals for your final project.
- You can work in pairs.
- It should involve programmatically combining (and/or filtering) the results of one data source with at least one other data source.
- The scope can be similar to the Collecting Dallas Officer-Involved Shootings
- The biggest allocation of your time should go into researching the problem domain (i.e. where does the data come from, how to access it, what are its known problems).
- The second-biggest timesuck of this project should probably go into writing the program, and this includes looking for programs and techniques.
- There must be an automatable component – i.e. what we typically consider a "program" – that I can run, on my own computer, as is, in a single command, and reproduce your results.
- It doesn't matter if your automatable program takes 5 minutes or 5 hours to fully execute. It should perform (again, on its own, without any interaction from the user) some kind of grunt work task that would take you dozens/hundreds of hours to do by hand.
- The structure of the program is left up to you; also, you can write it in whatever language you want, as long as it can run on corn.stanford.edu. It can be a program that calls other programs, for instance. In fact, I highly recommend that your project contain a few different program files, just as all of our homeworks have consisted of separate files for separate programs and functionality. In fact, if your project can be done in a single script of 20-lines or less, it may be too easy of a project.
- You'll probably have a manual component to the project, in which you take the results of your program and do something with it. Maybe clean the data of problems that require human judgment. Or build a nice looking web interface to your data results. Your automatable program isn't meant to solve the problem on its own, but to delegate
- The problem you solve must be somewhat related to some kind of civic or public affairs issue. Scraping Craigslist to efficiently find yourself a new couch does not fall into that category. But you don't have to save democracy or anything like that.
Note: Don't let you not knowing something as a programmer – such as, how to setup a program to run every hour on the hour, to continuously monitor something – be the barrier. Part of the value of this project is just being able to judge what is easy to give to the computer, what requires some research on your part, and what should be fully left to you, the human, to fix up. You may have been writing some horrible, slow code, but hopefully you've gotten an idea of what should and shouldn't be done by computers.
Examples
Here are concepts I like that you could attempt a limited-variation of for a project. Don't get wrapped up in the details, such as, whether the result should be a website or an auto-tweeting-bot. But rather, focus on how the program finds and filters data to bring you something interesting, and something that would be burdensome/impossible to do by clicking through a website interface or spreadsheet the old-fashioned way.
- Twitter Bot Detector - Devise a simple algorithm (think Apgar's score) that can be used as a kind of first-pass test for filtering a list of users – and other data from Twitter's API – and guessing the probability that they are a bot. Check out a more advanced version by Bot or Not?.
- The L.A. Times QuakeBot - a script written by Ken Schwenke to check the USGS Earthquake API and fill the blanks of a template story, and then notify him by email.
- The data component of Reuters' Water's Edge project - the data behind their investigative feature came from a script used to collect readings from the NOAA Tides and Currents API.
- NewsDiff - tracking changes to online news articles over time.
- SCOTUS-SERVO - tracking changes in Supreme Court opinions
- Perma.cc - preventing link-rot in legal and academic citations
- Tracking gender of reporters by byline - You can take the baby-names-based gender detector you worked on and combine it with data from web-scraping a news site. Check out the Who Writes For project to see the concept as a website.
- NYTAnon-like Bot - You know how to scrape from a news site (or use its API, if available), you know how to grep. Try making a variation of @NYTAnon, but for another news site. Or for a wider variety of phrases that indicate the presence of an anonymous source.