Week 8

Lesson plan:

Schock

The best ideas come from travel

https://instagram.com/aaronschock/

http://bigstory.ap.org/article/e2f1f52c3eb34caca7d74e5bf90f27f9/lawmaker-lavish-decor-billed-private-planes-concerts

# http://www.compciv.org/recipes/data/collecting-your-instagram-media-data/
cat ~/data/aaronschock-instagram/*.json | jq '.data[] .caption'

Download sunlight foundation data

mkdir -p /tmp/sunlight
cd /tmp/sunlight
curl -s http://sunlightfoundation.com/tools/expenditures/ | pup 'a attr{href}' | grep 'csv' | xargs wget

# look for travel
cat *.csv | grep SCHOCK | grep TRAV

Spotify

Let's get an artist
Let's get related artists
Let's get some songs

http://www.compciv.org/recipes/data/touring-the-spotify-api/

How Liberal/Conservative is a given Twitter account?

https://docs.google.com/spreadsheets/d/1cjRJyrPYj8KAhUrot8ubPOWgbAwm0OqIgPtsswJ2sjM/edit

Find a labeled set

People you know who are Republican/Democrat

News channels
pundits
Elected politicians

Find a feature

What makes someone a liberal or a conservative? To gauge that you need to do research, or read what that person says.

Sometimes, you look at what other people say about that person. If a lot of Republican people whom you know to be Republican say they like someone, then that someone must be a Republican.

How to quantify this for a computer?

Given a list of Repub./Demo. accounts, see who they follow. If more R's follow than D's, then that account may be considered a "Republican" account.

Programmatically:

Pick 10 accounts (this is always manual) that you know are Liberal/Conservative
Collect who they follow
For all the accounts followed, sum up the total of R and D followers.

Programming the political-tracking

This is based off of this guide: Find out who Congress follows on Twitter using the command line

Get the legislators

curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv

Filter for current legislators

csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv

Get all the friends of any given legislator

username=DarrellIssa
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
  json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
  if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
    next_cursor=0
    echo "errors: $(echo $json | jq '.errors[0] .message')"
  else  # just exist if there's an error
    echo $json | jq '.ids[]'
    next_cursor=$(echo $json | jq -r '.next_cursor')
  fi
done

A folder full of friends

Let's assume we have a directory full of friend_id files:

  data-hold/
    congress-tweets/
       friend_ids/
         |__aaronschock.txt
         |__andercrenshaw.txt

To find total number of followed accounts:

data_dir="data-hold/congress-tweets/friend_ids"

cat $data_dir/*.txt | wc -l

Who follows you? First, get your ID:

id=$(t whois --csv dancow | csvfix -smq -f 1 | tail -n 1)
cat $data_dir/*.txt | grep $id | wc -l
grep -l $id $data_dir/*.txt

Get all accounts liked by Democrats

data_dir="data-hold/congress-tweets/friend_ids"
grep -f <(csvfix find -f 7 -s D < current-legislators.csv |
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls $data_dir/*.txt) |
  xargs cat | sort | uniq -c | sort -rn |
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv


data_dir="data-hold/congress-tweets/friend_ids"
grep -f <(csvfix find -f 7 -s R < current-legislators.csv |
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls $data_dir/*.txt) |
  xargs cat | sort | uniq -c | sort -rn |
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv

Create a spreadsheet that is joined between the two

echo "ID,democrat_friends,republican_friends" > /tmp/friends_by_party.csv
csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv |
  csvfix find -smq -if '($2 + $3) > 10' >> /tmp/friends_by_party.csv

username=mmflint
myfile1="/tmp/$username-friends.csv"
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
  json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
  if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
    next_cursor=0
    echo "errors: $(echo $json | jq '.errors[0] .message')"
  else  # just exist if there's an error
    echo $json | jq '.ids[]'
    next_cursor=$(echo $json | jq -r '.next_cursor')
  fi
done > $myfile1

#### Find my friends

myfile2="/tmp/$username-party-friends.csv"
echo "ID,democrats,republicans" > $myfile2
csvfix join -f 1:1 $myfile1 /tmp/friends_by_party.csv >> $myfile2

open -a "/Applications/Microsoft\ Office\ 2011/Microsoft\ Excel.app/"  $myfile2

Note that there's a long of things we don't care about:

Who each account is
The nuance and spectrum between Republicans and Democrats. A follow by Steve King (R-IA) is counted the same as Sen. John McCain

Our model is very fast. It's fairly easy to explain. But we sacrifice nuance and complexity.

Sentiment calculator

http://www.sentiment140.com/ http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

Examples of sentiment analysis in code

http://nbviewer.ipython.org/github/tokestermw/twitter-bart/blob/master/ipynb/Twitter140-checkNB.ipynb

Nice slides: http://vumaasha.github.io/pychennai-sentiment-analysis/#/imagination

Bayesian theory

Intuitive formula for predicting an outcome based on past evidence. The key is how good is the past evidence?

Extremely long explanation

Explanation excerpt

Explanation with legos

Cancer example

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

via yudkowsky

Bayesian in application

The problem is that not all things can be as easily quantified as incidence of cancer and results of cancer screening. With sentiment analysis, you have to decide which things truly indicate positive/negative, and you don't know how many such factors there may be.

A training set

When you tag people on Facebook, you have given Facebook a training set
Finding good training sets is hard
Polarized movie reviews

You come up with features

For sentiment

Vectorized attributes

Word count
Letters
Bag of words CountVectorizer
TfdifVectorizer

Things that can impact the process

Looking back at the tweets, what is it that you mentally filter out
you look at words (you have to break them up)
You mentally stemmify the words

WordNetLemmatizer in NLTK:

nltk.stem.WordNetLemmatizer().lemmatize('alumni')

Natural language processing

NLTKBook with Naive Bayes example:

Training sets for sentiment analysis

Movies Yelp

http://www.cs.cornell.edu/people/pabo/movie-review-data/

NLTK stuff

import nltk
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus.reader.wordnet import NOUN

nltk.stem.WordNetLemmatizer().lemmatize('loves')

Gender identification

Try to guess what makes a name male or female (without the use of SSA statistics)

http://www.nltk.org/book/ch06.html

from nltk.corpus import names

# try
names.words('male.txt')








## come up with a another feature
def gender_features(word):
   return {'last_letter': word[-1]}


labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])


# make a training set
import random
random.shuffle(labeled_names)


# train
# for every name, run the gender_features function
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

# split it into training, testing
# remember that everything is labeled
train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

classifier.classify(gender_features("Trinity"))

## test accuracy
print(nltk.classify.accuracy(classifier, test_set))

## most informative features

classifier.show_most_informative_features(5)


###### A bad guess: first letter of a name

def dumb_features(word):
   return {'first_letter': word[0]}

dumbfeaturesets = [(dumb_features(n), gender) for (n, gender) in labeled_names]

dumb_train_set, dumb_test_set = featuresets[500:], featuresets[:500]

dumbclassifier = nltk.NaiveBayesClassifier.train(dumb_train_set)

dumbclassifier.classify(dumb_features("Daniel"))
# classify accuracy
print(nltk.classify.accuracy(dumbclassifier, dumb_test_set))

dumbclassifier.show_most_informative_features(5)



## try last two letters

def better_features(word):
  return {'last_two_letters': word[-2:]}

better_featuresets = [(better_features(n), gender) for (n, gender) in labeled_names]

better_train_set, better_test_set = better_featuresets[500:], better_featuresets[:500]

better_classifier = nltk.NaiveBayesClassifier.train(better_train_set)

better_classifier.classify(better_features("Daniel"))

print(nltk.classify.accuracy(better_classifier, better_test_set))

better_classifier.show_most_informative_features(5)

Output of better features:

Most Informative Features
        last_two_letters = u'na'          female : male   =    101.5 : 1.0
        last_two_letters = u'la'          female : male   =     77.2 : 1.0
        last_two_letters = u'ia'          female : male   =     40.4 : 1.0
        last_two_letters = u'sa'          female : male   =     33.8 : 1.0
        last_two_letters = u'us'            male : female =     27.5 : 1.0