The best ideas come from travel
https://instagram.com/aaronschock/
http://bigstory.ap.org/article/e2f1f52c3eb34caca7d74e5bf90f27f9/lawmaker-lavish-decor-billed-private-planes-concerts
# http://www.compciv.org/recipes/data/collecting-your-instagram-media-data/
cat ~/data/aaronschock-instagram/*.json | jq '.data[] .caption'
Download sunlight foundation data
mkdir -p /tmp/sunlight
cd /tmp/sunlight
curl -s http://sunlightfoundation.com/tools/expenditures/ | pup 'a attr{href}' | grep 'csv' | xargs wget
# look for travel
cat *.csv | grep SCHOCK | grep TRAV
http://www.compciv.org/recipes/data/touring-the-spotify-api/
https://docs.google.com/spreadsheets/d/1cjRJyrPYj8KAhUrot8ubPOWgbAwm0OqIgPtsswJ2sjM/edit
People you know who are Republican/Democrat
What makes someone a liberal or a conservative? To gauge that you need to do research, or read what that person says.
Sometimes, you look at what other people say about that person. If a lot of Republican people whom you know to be Republican say they like someone, then that someone must be a Republican.
How to quantify this for a computer?
Given a list of Repub./Demo. accounts, see who they follow. If more R's follow than D's, then that account may be considered a "Republican" account.
Programmatically:
This is based off of this guide: Find out who Congress follows on Twitter using the command line
curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv
Get all the friends of any given legislator
username=DarrellIssa
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
next_cursor=0
echo "errors: $(echo $json | jq '.errors[0] .message')"
else # just exist if there's an error
echo $json | jq '.ids[]'
next_cursor=$(echo $json | jq -r '.next_cursor')
fi
done
Let's assume we have a directory full of friend_id files:
data-hold/
congress-tweets/
friend_ids/
|__aaronschock.txt
|__andercrenshaw.txt
To find total number of followed accounts:
data_dir="data-hold/congress-tweets/friend_ids"
cat $data_dir/*.txt | wc -l
Who follows you? First, get your ID:
id=$(t whois --csv dancow | csvfix -smq -f 1 | tail -n 1)
cat $data_dir/*.txt | grep $id | wc -l
grep -l $id $data_dir/*.txt
data_dir="data-hold/congress-tweets/friend_ids"
grep -f <(csvfix find -f 7 -s D < current-legislators.csv |
csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
<(ls $data_dir/*.txt) |
xargs cat | sort | uniq -c | sort -rn |
sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv
data_dir="data-hold/congress-tweets/friend_ids"
grep -f <(csvfix find -f 7 -s R < current-legislators.csv |
csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
<(ls $data_dir/*.txt) |
xargs cat | sort | uniq -c | sort -rn |
sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv
echo "ID,democrat_friends,republican_friends" > /tmp/friends_by_party.csv
csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv |
csvfix find -smq -if '($2 + $3) > 10' >> /tmp/friends_by_party.csv
username=mmflint
myfile1="/tmp/$username-friends.csv"
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
next_cursor=0
echo "errors: $(echo $json | jq '.errors[0] .message')"
else # just exist if there's an error
echo $json | jq '.ids[]'
next_cursor=$(echo $json | jq -r '.next_cursor')
fi
done > $myfile1
#### Find my friends
myfile2="/tmp/$username-party-friends.csv"
echo "ID,democrats,republicans" > $myfile2
csvfix join -f 1:1 $myfile1 /tmp/friends_by_party.csv >> $myfile2
open -a "/Applications/Microsoft\ Office\ 2011/Microsoft\ Excel.app/" $myfile2
Note that there's a long of things we don't care about:
Our model is very fast. It's fairly easy to explain. But we sacrifice nuance and complexity.
http://www.sentiment140.com/ http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
http://nbviewer.ipython.org/github/tokestermw/twitter-bart/blob/master/ipynb/Twitter140-checkNB.ipynb
Nice slides: http://vumaasha.github.io/pychennai-sentiment-analysis/#/imagination
Intuitive formula for predicting an outcome based on past evidence. The key is how good is the past evidence?
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
via yudkowsky
The problem is that not all things can be as easily quantified as incidence of cancer and results of cancer screening. With sentiment analysis, you have to decide which things truly indicate positive/negative, and you don't know how many such factors there may be.
For sentiment
WordNetLemmatizer in NLTK:
nltk.stem.WordNetLemmatizer().lemmatize('alumni')
NLTKBook with Naive Bayes example:
http://www.cs.cornell.edu/people/pabo/movie-review-data/
import nltk
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus.reader.wordnet import NOUN
nltk.stem.WordNetLemmatizer().lemmatize('loves')
Try to guess what makes a name male or female (without the use of SSA statistics)
http://www.nltk.org/book/ch06.html
from nltk.corpus import names
# try
names.words('male.txt')
## come up with a another feature
def gender_features(word):
return {'last_letter': word[-1]}
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
# make a training set
import random
random.shuffle(labeled_names)
# train
# for every name, run the gender_features function
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
# split it into training, testing
# remember that everything is labeled
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.classify(gender_features("Trinity"))
## test accuracy
print(nltk.classify.accuracy(classifier, test_set))
## most informative features
classifier.show_most_informative_features(5)
###### A bad guess: first letter of a name
def dumb_features(word):
return {'first_letter': word[0]}
dumbfeaturesets = [(dumb_features(n), gender) for (n, gender) in labeled_names]
dumb_train_set, dumb_test_set = featuresets[500:], featuresets[:500]
dumbclassifier = nltk.NaiveBayesClassifier.train(dumb_train_set)
dumbclassifier.classify(dumb_features("Daniel"))
# classify accuracy
print(nltk.classify.accuracy(dumbclassifier, dumb_test_set))
dumbclassifier.show_most_informative_features(5)
## try last two letters
def better_features(word):
return {'last_two_letters': word[-2:]}
better_featuresets = [(better_features(n), gender) for (n, gender) in labeled_names]
better_train_set, better_test_set = better_featuresets[500:], better_featuresets[:500]
better_classifier = nltk.NaiveBayesClassifier.train(better_train_set)
better_classifier.classify(better_features("Daniel"))
print(nltk.classify.accuracy(better_classifier, better_test_set))
better_classifier.show_most_informative_features(5)
Output of better features:
Most Informative Features
last_two_letters = u'na' female : male = 101.5 : 1.0
last_two_letters = u'la' female : male = 77.2 : 1.0
last_two_letters = u'ia' female : male = 40.4 : 1.0
last_two_letters = u'sa' female : male = 33.8 : 1.0
last_two_letters = u'us' male : female = 27.5 : 1.0