Computational Methods in the Civic Sphere

A winter elective on programming and journalism for the Stanford Computational Journalism Lab

COMM 113/213 - Winter 2015

Monday and Wednesday, 2:15 to 3:45PM
Building 200, Room 303 [CourseExplorer link]

Instructor: Dan Nguyen | @dancow | dun @ stanford
Office hours: Tuesday and Thursdays, 2 to 4PM. Or by appointment.
Piazza: The class discussion board is hosted on Piazza. Feel free to ask questions and collaborate on there.


Agenda for March 8, 2015 - More incomplete notes.

Agenda for March 1, 2015 - Intro to machine learning and Bayesian fun.

Agenda for Feb. 25, 2015

Agenda for Feb. 16, 2015

Homework

Title Due date Points
Collecting Dallas Officer-Involved Shootings
Collect and parse the Dallas Police Department's officer-involved shooting data and make an interactive map.
Wednesday, March 11 20
404-Finder
Write a program to auto-detect broken links
Tuesday, March 10 5
The Celebrity (Tw)It List
Finding out who the most-followed users follow on Twitter.
Tuesday, March 10 5
Draft proposal of a final project
Use your computational methods to solve a computational problem of your own choosing.
Tuesday, February 24 1
Build face-grep in Python
Taking the Unix philosophy to Python and computer vision object-detection algorithms.
Friday, February 20 5
Listing the BuzzFeed listicles
Practicing web-scraping and regexes on BuzzFeed listicle titles
Tuesday, February 17 5
Analyzing Tweets in CSV form
Connect to the Twitter API, download a user's tweets as CSV, and count frequency of hashtags and words.
Friday, February 13 5
Firsts in American baby-naming
Even more practice with text filters, this time to find when baby names first became known.
Tuesday, February 10 3
Collecting and analyzing job listings from the USAJobs.gov API
Ask what you can do for your country, and what your country can pay you.
Friday, February 6 10
Using baby names to classify names by gender
Use the SSA baby name data to make a naive filter for guessing the gender of a name.
Tuesday, February 3 5
Death Row rows parsing
Collect and aggregate data from three different states' death row listings.
Friday, January 30 10
Basic if-else practice
Practice the logic of if-elif-else conditional branching
Friday, January 30 5
Exploring Congressional Twitter data as JSON
Basic JSON parsing exercise using what Congress tweets.
Tuesday, January 27 5
More analysis of trends in American baby-naming
More practice with text filters to find interesting trends in the SSA baby name data.
Tuesday, January 27 5
Parsing the White House Press Briefings as HTML
Data analysis of all the words used in the White House press briefings
Thursday, January 22 10
Managing baby names and data projects with Github
A sampler project that demonstrates how your code and data should be organized for minimal head-smashing.
Friday, January 16 5
Basic word analysis of the White House Press Briefings
After collecting the list of WH Briefings, it's time to get each briefing.
Friday, January 16 10
Collecting the White House Press Briefings
The first step in analyzing web data is to just collect the webpages.
Wednesday, January 14 10
Setup Prep
Setting up our programming toolbox and environment.
Wednesday, January 7 10

Computational Methods in the Civic Sphere (COMM 113/213) examines why some information problems are computational – and why others are not – in the context of journalistic enterprise and its wide variety of information problems: research, data collection, data cleaning, statistical analysis, information design, information retrieval, verification, publication, and mass distribution.

We will study real-world problems in journalism and data science, and we will also attempt to solve them. We will study programming, because many of these problems can be substantially solved through programming. But we will also learn why not every problem can or should be fitted to a mechanical algorithm.

Students who successfully complete this class will inevitably learn a wide array of tools and techniques. But gaining a useful skillset is only a coincidental outcome. Our main goal is to learn how to think, and to understand how a computer can complement, but not replace our ability to make decisions.

Tentative schedule

  • Week 1: Monday, January 5
    Introduction to CompCiv, computational problems and the Unix Way
    The philosophy of the course and a boot camp on working with Unix-like systems. We'll set up our working environment on Stanford's shared computing and learn how to do things completely from the command-line, while picking up some programming fundamentals along the way.
  • Week 2: Monday, January 12
    Text processing and exploring APIs
    Since text is such a fundamental medium for computing, learning how to process text will be one of our most vital skills, especially when working with data. We'll learn a mini-language, regular expressions, to find text by patterns. We'll also examine how complex data is serialized into plaintext formats such as JSON. These plaintext data files are a nearly universal format for modern APIs such as Twitter.
  • Week 3: Monday, January 19
    Reading and researching APIs
    Expanding on our ability to work with textual data, we'll learn to parse the JSON data format and learn how to programmatically search and retrieve data from a variety of onnline data sources.
  • Week 4: Monday, January 26
    APIs, Web scraping, and web publishing
    The data on web sites are just another form of structured text (and APIs, when available, are a formalized, cleaner way of getting structured text). The process of scraping involves just another text pattern to understand and take apart, with some additional thinking about how we design programs to be efficient and less prone to failure (such as when the Internet goes down). Since most complex web sites are generated by programs, we'll be able to produce HTML as well as consume it.
  • Week : Monday, February 2
    More HTML and data visualization and analysis
    Plaintext output of data columns has a limited range of expressiveness, so we'll learn how to include visualization libraries in our programming pipelines. Originally, this week's description talked about databases. We'll be working with large datasets, but we'll stick to parsing JSON and text files and producing web pages, rather than trying to learn a SQL syntax on top of what we know.
  • Week 6: Monday, February 9
    Processing documents and multimedia
    After staring (mostly) at text for the past few weeks, this will be a shift in scenery, as we do some hands-on work with the binary formats of images and other multimedia formats. Some of the work will involve turning imagery into text, as in the case of scanned documents and optical character resolution. Other processes involve programmatic manipulation and transformation of multimedia. And other concepts, such as face recognition, work as a segue into the statistical classification methods that we'll see in the remaining weeks.
  • Week 7: Monday, February 16
    Unstructured data and natural language processing
    With the Python Natural Language Toolkit, we take a deeper look at the strategies and concepts needed to work with data -- in this case, human language -- that doesn't have the structure and conveniences of text that comes from a database or API.
  • Week 8: Monday, February 23
    Introduction to machine learning
    Using Python's scikit-learn library, we'll be able to test the effectiveness (and speed) of different machine learning algorithms, as well as how quality and size of training datasets affect the results of machine learning processes.
  • Week 9: Monday, March 2
    Applications of machine learning
    Continuing on the previous week's lessons, we'll try to test the practical effectiveness of unsupervised machine learning on real world data problems, as well as think through how human critical thinking and analysis can be best augmented by algorithmic processes.
  • Week : Monday, March 9
    TBA
    A spillover week. Since we lose two lecture periods to federal holidays, the overall schedule of topics may be pushed into the final week. If not, maybe we'll just have a final. week: 10

Grading

Homework 60%
Midterm 20%
Attendance and pop quizzes 20%
Extra Credit 10 to 20%

This class heavily emphasizes problem solving. Homework assignments are often structured as mini-projects, but with definite right and wrong answers.

Catching up

Extra credit projects will be generated on a regular basis.

Attendance policy

Much of this classwork is based on the concept of flexibility and abstraction, including the ability to confidently write programs that run on their own, independently of our interaction, and on a variety of machines, from our personal laptops to cloud servers. And virtually every topic I cover in lecture will be posted online.

The obvious question you should have is: why even show up for class?

For camaraderie, perhaps, such as relief from spending too much time in front of an electronic screen. But also, to discuss the concepts and to bounce ideas and get feedback for new projects, or different avenues of exploration. This is easier done in a group, face-to-face, so be prepared to show up as if this were any traditional lecture.

Required textbook

The required textbook will be Data Science at the Command Line, published in 2014 by Jeroen Janssens in 2014. It works both as a handy technical reference and a book full of interesting data science explorations.

You can purchase the book at O'Reilly: $34 for the ebook version, or $44 for both the ebook and print version. The book is also available on Amazon.

Inspirational projects

These are a few journalism-related projects that I have in mind when I think about the use of computational problem solving. I'm hoping that students will not only be able to understand why these projects were conceived and how they work, but also implement them in part.

(Note: This is not-at-all a complete list of worthwhile journalism projects, but a partial list of projects that can more easily dissected and studied.)

Additional prep notes

If you are taking this class, or are just following along, here are some setup steps for our work environment: