Week 8

Lesson plan:


The best ideas come from travel



# http://www.compciv.org/recipes/data/collecting-your-instagram-media-data/
cat ~/data/aaronschock-instagram/*.json | jq '.data[] .caption'

Download sunlight foundation data

mkdir -p /tmp/sunlight
cd /tmp/sunlight
curl -s http://sunlightfoundation.com/tools/expenditures/ | pup 'a attr{href}' | grep 'csv' | xargs wget

# look for travel
cat *.csv | grep SCHOCK | grep TRAV



How Liberal/Conservative is a given Twitter account?


Find a labeled set

People you know who are Republican/Democrat

Find a feature

What makes someone a liberal or a conservative? To gauge that you need to do research, or read what that person says.

Sometimes, you look at what other people say about that person. If a lot of Republican people whom you know to be Republican say they like someone, then that someone must be a Republican.

How to quantify this for a computer?

Given a list of Repub./Demo. accounts, see who they follow. If more R's follow than D's, then that account may be considered a "Republican" account.


Programming the political-tracking

This is based off of this guide: Find out who Congress follows on Twitter using the command line

Get the legislators
curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Filter for current legislators
csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv

Get all the friends of any given legislator

while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
  json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
  if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
    echo "errors: $(echo $json | jq '.errors[0] .message')"
  else  # just exist if there's an error
    echo $json | jq '.ids[]'
    next_cursor=$(echo $json | jq -r '.next_cursor')

A folder full of friends

Let's assume we have a directory full of friend_id files:


To find total number of followed accounts:


cat $data_dir/*.txt | wc -l

Who follows you? First, get your ID:

id=$(t whois --csv dancow | csvfix -smq -f 1 | tail -n 1)
cat $data_dir/*.txt | grep $id | wc -l
grep -l $id $data_dir/*.txt

Get all accounts liked by Democrats

grep -f <(csvfix find -f 7 -s D < current-legislators.csv |
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls $data_dir/*.txt) |
  xargs cat | sort | uniq -c | sort -rn |
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv

grep -f <(csvfix find -f 7 -s R < current-legislators.csv |
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls $data_dir/*.txt) |
  xargs cat | sort | uniq -c | sort -rn |
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv

Create a spreadsheet that is joined between the two

echo "ID,democrat_friends,republican_friends" > /tmp/friends_by_party.csv
csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv |
  csvfix find -smq -if '($2 + $3) > 10' >> /tmp/friends_by_party.csv
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
  json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
  if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
    echo "errors: $(echo $json | jq '.errors[0] .message')"
  else  # just exist if there's an error
    echo $json | jq '.ids[]'
    next_cursor=$(echo $json | jq -r '.next_cursor')
done > $myfile1

#### Find my friends

echo "ID,democrats,republicans" > $myfile2
csvfix join -f 1:1 $myfile1 /tmp/friends_by_party.csv >> $myfile2

open -a "/Applications/Microsoft\ Office\ 2011/Microsoft\ Excel.app/"  $myfile2

Note that there's a long of things we don't care about:

Our model is very fast. It's fairly easy to explain. But we sacrifice nuance and complexity.

Sentiment calculator

http://www.sentiment140.com/ http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

Examples of sentiment analysis in code


Nice slides: http://vumaasha.github.io/pychennai-sentiment-analysis/#/imagination

Bayesian theory

Intuitive formula for predicting an outcome based on past evidence. The key is how good is the past evidence?

Extremely long explanation

Explanation excerpt

Explanation with legos

Cancer example

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

via yudkowsky

Bayesian in application

The problem is that not all things can be as easily quantified as incidence of cancer and results of cancer screening. With sentiment analysis, you have to decide which things truly indicate positive/negative, and you don't know how many such factors there may be.

A training set
You come up with features

For sentiment

Vectorized attributes
Things that can impact the process

WordNetLemmatizer in NLTK:

Natural language processing

NLTKBook with Naive Bayes example:

Training sets for sentiment analysis

Movies Yelp


NLTK stuff

import nltk
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus.reader.wordnet import NOUN


Gender identification

Try to guess what makes a name male or female (without the use of SSA statistics)


from nltk.corpus import names

# try

## come up with a another feature
def gender_features(word):
   return {'last_letter': word[-1]}

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

# make a training set
import random

# train
# for every name, run the gender_features function
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

# split it into training, testing
# remember that everything is labeled
train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)


## test accuracy
print(nltk.classify.accuracy(classifier, test_set))

## most informative features


###### A bad guess: first letter of a name

def dumb_features(word):
   return {'first_letter': word[0]}

dumbfeaturesets = [(dumb_features(n), gender) for (n, gender) in labeled_names]

dumb_train_set, dumb_test_set = featuresets[500:], featuresets[:500]

dumbclassifier = nltk.NaiveBayesClassifier.train(dumb_train_set)

# classify accuracy
print(nltk.classify.accuracy(dumbclassifier, dumb_test_set))


## try last two letters

def better_features(word):
  return {'last_two_letters': word[-2:]}

better_featuresets = [(better_features(n), gender) for (n, gender) in labeled_names]

better_train_set, better_test_set = better_featuresets[500:], better_featuresets[:500]

better_classifier = nltk.NaiveBayesClassifier.train(better_train_set)


print(nltk.classify.accuracy(better_classifier, better_test_set))


Output of better features:

Most Informative Features
        last_two_letters = u'na'          female : male   =    101.5 : 1.0
        last_two_letters = u'la'          female : male   =     77.2 : 1.0
        last_two_letters = u'ia'          female : male   =     40.4 : 1.0
        last_two_letters = u'sa'          female : male   =     33.8 : 1.0
        last_two_letters = u'us'            male : female =     27.5 : 1.0