Computational Methods in the Civic Sphere

A winter elective on programming and journalism for the Stanford Computational Journalism Lab

COMM 113/213 - Winter 2015

Monday and Wednesday, 2:15 to 3:45PM
Building 200, Room 303 [CourseExplorer link]

Instructor: Dan Nguyen | @dancow | dun @ stanford
Office hours: Tuesday and Thursdays, 2 to 4PM. Or by appointment.
Piazza: The class discussion board is hosted on Piazza. Feel free to ask questions and collaborate on there.

Main links

Curriculum - A list of everything we've learned so far and in the general order of learning
Unix tools we've used - Most of the Unix tools we've used and some of their use-cases
Guide to Bash fundamentals - A list of programming and Bash fundamentals

Lecture links and notes

Agenda for March 8, 2015 - More incomplete notes.

Agenda for March 1, 2015 - Intro to machine learning and Bayesian fun.

Agenda for Feb. 25, 2015

Agenda for Feb. 16, 2015

Lecture notes for Week 5, Feb. 2, 2015
Links for Week 4, January 26, 2015
- Lecture notes for Week 4 - More about APIs, data, and conditional branching.
Links for Week 3, January 21, 2015
- Lecture notes for Week 3 - We'll be looking at a new data format as we begin working with APIs
Links for Week 2, January 12, 2015
- Lecture and homework assignments for Week 2
- Bash guide (under construction)

Homework

Title	Due date	Points
Collecting Dallas Officer-Involved Shootings Collect and parse the Dallas Police Department's officer-involved shooting data and make an interactive map.	Wednesday, March 11	20
404-Finder Write a program to auto-detect broken links	Tuesday, March 10	5
The Celebrity (Tw)It List Finding out who the most-followed users follow on Twitter.	Tuesday, March 10	5
Draft proposal of a final project Use your computational methods to solve a computational problem of your own choosing.	Tuesday, February 24	1
Build face-grep in Python Taking the Unix philosophy to Python and computer vision object-detection algorithms.	Friday, February 20	5
Listing the BuzzFeed listicles Practicing web-scraping and regexes on BuzzFeed listicle titles	Tuesday, February 17	5
Analyzing Tweets in CSV form Connect to the Twitter API, download a user's tweets as CSV, and count frequency of hashtags and words.	Friday, February 13	5
Firsts in American baby-naming Even more practice with text filters, this time to find when baby names first became known.	Tuesday, February 10	3
Collecting and analyzing job listings from the USAJobs.gov API Ask what you can do for your country, and what your country can pay you.	Friday, February 6	10
Using baby names to classify names by gender Use the SSA baby name data to make a naive filter for guessing the gender of a name.	Tuesday, February 3	5
Death Row rows parsing Collect and aggregate data from three different states' death row listings.	Friday, January 30	10
Basic if-else practice Practice the logic of if-elif-else conditional branching	Friday, January 30	5
Exploring Congressional Twitter data as JSON Basic JSON parsing exercise using what Congress tweets.	Tuesday, January 27	5
More analysis of trends in American baby-naming More practice with text filters to find interesting trends in the SSA baby name data.	Tuesday, January 27	5
Parsing the White House Press Briefings as HTML Data analysis of all the words used in the White House press briefings	Thursday, January 22	10
Managing baby names and data projects with Github A sampler project that demonstrates how your code and data should be organized for minimal head-smashing.	Friday, January 16	5
Basic word analysis of the White House Press Briefings After collecting the list of WH Briefings, it's time to get each briefing.	Friday, January 16	10
Collecting the White House Press Briefings The first step in analyzing web data is to just collect the webpages.	Wednesday, January 14	10
Setup Prep Setting up our programming toolbox and environment.	Wednesday, January 7	10

Computational Methods in the Civic Sphere (COMM 113/213) examines why some information problems are computational – and why others are not – in the context of journalistic enterprise and its wide variety of information problems: research, data collection, data cleaning, statistical analysis, information design, information retrieval, verification, publication, and mass distribution.

We will study real-world problems in journalism and data science, and we will also attempt to solve them. We will study programming, because many of these problems can be substantially solved through programming. But we will also learn why not every problem can or should be fitted to a mechanical algorithm.

Students who successfully complete this class will inevitably learn a wide array of tools and techniques. But gaining a useful skillset is only a coincidental outcome. Our main goal is to learn how to think, and to understand how a computer can complement, but not replace our ability to make decisions.

Tentative schedule

Week 1: Monday, January 5

Introduction to CompCiv, computational problems and the Unix Way
The philosophy of the course and a boot camp on working with Unix-like systems. We'll set up our working environment on Stanford's shared computing and learn how to do things completely from the command-line, while picking up some programming fundamentals along the way.
Week 2: Monday, January 12

Text processing and exploring APIs
Since text is such a fundamental medium for computing, learning how to process text will be one of our most vital skills, especially when working with data. We'll learn a mini-language, regular expressions, to find text by patterns. We'll also examine how complex data is serialized into plaintext formats such as JSON. These plaintext data files are a nearly universal format for modern APIs such as Twitter.
Week 3: Monday, January 19

Reading and researching APIs
Expanding on our ability to work with textual data, we'll learn to parse the JSON data format and learn how to programmatically search and retrieve data from a variety of onnline data sources.
Week 4: Monday, January 26

APIs, Web scraping, and web publishing
The data on web sites are just another form of structured text (and APIs, when available, are a formalized, cleaner way of getting structured text). The process of scraping involves just another text pattern to understand and take apart, with some additional thinking about how we design programs to be efficient and less prone to failure (such as when the Internet goes down). Since most complex web sites are generated by programs, we'll be able to produce HTML as well as consume it.
Week : Monday, February 2

More HTML and data visualization and analysis
Plaintext output of data columns has a limited range of expressiveness, so we'll learn how to include visualization libraries in our programming pipelines. Originally, this week's description talked about databases. We'll be working with large datasets, but we'll stick to parsing JSON and text files and producing web pages, rather than trying to learn a SQL syntax on top of what we know.
Week 6: Monday, February 9

Processing documents and multimedia
After staring (mostly) at text for the past few weeks, this will be a shift in scenery, as we do some hands-on work with the binary formats of images and other multimedia formats. Some of the work will involve turning imagery into text, as in the case of scanned documents and optical character resolution. Other processes involve programmatic manipulation and transformation of multimedia. And other concepts, such as face recognition, work as a segue into the statistical classification methods that we'll see in the remaining weeks.
Week 7: Monday, February 16

Unstructured data and natural language processing
With the Python Natural Language Toolkit, we take a deeper look at the strategies and concepts needed to work with data -- in this case, human language -- that doesn't have the structure and conveniences of text that comes from a database or API.
Week 8: Monday, February 23

Introduction to machine learning
Using Python's scikit-learn library, we'll be able to test the effectiveness (and speed) of different machine learning algorithms, as well as how quality and size of training datasets affect the results of machine learning processes.
Week 9: Monday, March 2

Applications of machine learning
Continuing on the previous week's lessons, we'll try to test the practical effectiveness of unsupervised machine learning on real world data problems, as well as think through how human critical thinking and analysis can be best augmented by algorithmic processes.
Week : Monday, March 9

TBA
A spillover week. Since we lose two lecture periods to federal holidays, the overall schedule of topics may be pushed into the final week. If not, maybe we'll just have a final. week: 10

Grading

Homework	60%
Midterm	20%
Attendance and pop quizzes	20%
Extra Credit	10 to 20%

This class heavily emphasizes problem solving. Homework assignments are often structured as mini-projects, but with definite right and wrong answers.

Catching up

Extra credit projects will be generated on a regular basis.

Attendance policy

Much of this classwork is based on the concept of flexibility and abstraction, including the ability to confidently write programs that run on their own, independently of our interaction, and on a variety of machines, from our personal laptops to cloud servers. And virtually every topic I cover in lecture will be posted online.

The obvious question you should have is: why even show up for class?

For camaraderie, perhaps, such as relief from spending too much time in front of an electronic screen. But also, to discuss the concepts and to bounce ideas and get feedback for new projects, or different avenues of exploration. This is easier done in a group, face-to-face, so be prepared to show up as if this were any traditional lecture.

Required textbook

The required textbook will be Data Science at the Command Line, published in 2014 by Jeroen Janssens in 2014. It works both as a handy technical reference and a book full of interesting data science explorations.

You can purchase the book at O'Reilly: $34 for the ebook version, or $44 for both the ebook and print version. The book is also available on Amazon.

Inspirational projects

These are a few journalism-related projects that I have in mind when I think about the use of computational problem solving. I'm hoping that students will not only be able to understand why these projects were conceived and how they work, but also implement them in part.

(Note: This is not-at-all a complete list of worthwhile journalism projects, but a partial list of projects that can more easily dissected and studied.)

Quakebot - the earthquake-story-writing robot
Bot or Not? - Twitter spambot detection
The OpenGender Tracker - analyzing gender voices in the news
Free the Files - crowdsourced document data extraction
NewsDiff - tracking online news over time
Perma.cc - preventing link-rot in legal and academic citations
Times Haiku - auto-generation of haikus from NYT stories
SCOTUS-SERVO - tracking changes in Supreme Court opinions
Inside the Firewall - Tracking the News That China Blocks
CongressEdits - Tweeting anonymous edits made by IP addresses from Congress
Content, Forever - meandering essays by an algorithm with a short attention span
unitedstates/congress - Public domain code that collects data about the bills, amendments, roll call votes, and other core data about the U.S. Congress.

Software Carpentry (free)
The Linux Information Project (free)
Practical Unix (free)
Interactive Data Visualization for the Web (free)
Natural Language Processing with Python (free)
Learn Python the Hard Way (free)
Python Cookbook (free)
Introduction to Information Retrieval (free)
bash Cookbook
Data Smart: Using Data Science to Transform Information into Insight
Scraping for Journalists
Mastering Regular Expressions
Mining of Massive Datasets (free)
Nine Algorithms That Changed the Future
Sublime Text Power User
Stanford's Computer Science 101 self-paced course (free)

Additional prep notes

If you are taking this class, or are just following along, here are some setup steps for our work environment:

Download and install Sublime Text, a cross-platform text editor.
Create an account on Github; we won't necessarily be using a lot of its collaborative software development features, but it provides some helpful conveniences for showing and managing our code.
Sign up for the Github Student Developer Pack. There's a lot of great discounts and services included; the one most relevant to us is the $100 credit for DigitalOcean, which will allow us operate cloud applications.
If you are at Stanford: Attempt to connect to corn.stanford.edu, a shared-computing service for the Stanford community. There are instructions on the Stanford FarmShare wiki.
Sign up for a Twitter account. Tweet a few times. Follow a bunch of accounts. Twitter will be a useful data source for us.
Purchase a copy of the class textbook, Data Science at the Command Line. If you buy the ebook on O'Reilly, there's a buy-one, get-one-free deal using the discount code MBBGS. The book is also available on Amazon.