Week 9

Intro to machine learning

Machine learning is the science of getting computers to act without being explicitly programmed.

Introduction to Stanford professor Andrew Ng's Machine Learning, on Coursera

Stanford Machine Learning on Coursera. The on-demand version

The tradeoff between speed, cost, and accuracy

In sentiment analysis of English language sentences, state-of-the-art accuracy in machine learning ranges from 81 to 85%. The average adult could classify such sentences at least 9 out of 10 times. And an "algorithm" could involve breaking up text and assigning text to humans via Mechanical Turk. And that might be the right solution if speed and cost aren't your main concern.

In 2007, Netflix awarded $1 million to a team that came up with an algorithm that improved the prediction of user movie ratings by 8.43%, yet Netflix never used it.

In a world in which resources are still finite, and barring an unforeseen breakthrough, there is no 100%-always-right computational technique for machine learning.

The importance of data to machine learning

Algorithms are one part of machine learning effectiveness, another (huge) part is the data used to "train" and evaluate the algorithms. Another variation of that idea is, Garbage in, garbage out

Are webcams racist?

HP Responds to Claim of 'Racist' Webcams

HP looking into claim webcams can't see black people

"The technology we use is built on standard algorithms that measure the difference in intensity of contrast between the eyes and the upper cheek and nose," wrote Tony Welch, the lead social media strategist for HP's Personal Systems Group. "We believe that the camera might have difficulty 'seeing' contrast in conditions where there is insufficient foreground lighting."

You can revisit the face-boxer script and tweak it, or a screengrab of the original YouTube video.

Don't accept "it's just science and math" as the final answer. The scientific and mathematical principles may be objective, but with computational efficiency being a factor, the balancing of speed versus accuracy is very much up to humans to make the decision on what the machine "sees." Theoretically, a classifier that accounts for eyeglasses requires some extra data and complexity, and yet it's rare to hear someone defend a classifier with, "Well, just take off your glasses first."

Via BuzzFeed: Teaching The Camera To See My Skin - "Navigating photography’s inherited bias against dark skin." via Syreeta McFadden

It turns out, film stock’s failures to capture dark skin aren’t a technical issue, they’re a choice. Lorna Roth, a scholar in media and communication studies, wrote that film emulsions — the coating on the film base that reacts with chemicals and light to produce an image — “could have been designed initially with more sensitivity to the continuum of yellow, brown and reddish skin tones but the design process would have to be motivated by a recognition of the need for extended range.” Back then there was little motivation to acknowledge, let alone cater to a market beyond white consumers.

Kodak did finally modify its film emulsion stocks in the 1970s and ’80s — but only after complaints from companies trying to advertise chocolate and wood furniture. The resulting Gold Max film stock was created. According to Roth, a Kodak executive described the film as being able to “photograph the details of the dark horse in low light.”

Kodak never encountered a groundswell of complaints from African-Americans about their products. Many of us simply assumed the deficiencies of film emulsion performance reflected our inadequacies as photographers. Perhaps we didn’t understand the principles of photography. It is science, after all.

…If you’re modeling light settings and defining the meter readings about a balanced image against white skin, the contours and shape of a white face, you’ve immediately erased 70% of the world’s population. It wasn’t until the mid-1990s that the calibration model for color reference models fully shifted away from Shirley to be inclusive of full range of skin tones.

Revisiting Sentiment140

Read the technical paper by Stanford students, Alec Go, Richa Bhayani, and Lei Huang: Twitter Sentiment Classification using Distant Supervision

Emoticons for training data

After post-processing the data, we take the first 800,000 tweets with positive emoticons, and 800,000 tweets with negative emoticons, for a total of 1,600,000 training tweets. The test data is manually collected, using the web application.

A set of 177 negative tweets and 182 positive tweets were manually marked. Not all the test data has emoticons. We use the following process to collect test data:

We search the Twitter API with specific queries. These queries are arbitrarily chosen from different domains. For example, these queries consist of consumer products (40d, 50d, kindle2), companies (aig, at&t), and people (Bobby Flay, Warren Buffet). The query terms we used are listed in Table 4. The different categories of these queries are listed in Table 5.

We look at the result set for a query. If we see a result that contains a sentiment, we mark it as positive or negative. Thus, this test set is selected independently of the presence of emoticons.

Keep in mind that using Twitter data to make judgments about the world-in-general is problematic:

The problems with using Twitter as a model for the general population are simple. You don’t have to be a pollster to understand that searching for tweets that match some keywords hardly constitutes proper probabilistic sampling. We might display a map that shows colors mentioned by Americans on Twitter, but nobody would say this is an accurate map of favorite colors for each region of the USA. Naturally, most graphics play it safe and say overtly that they are only representions of Twitter and are not meant to provide deeper insight beyond that into the general population.

However, if we're using Twitter data to train a classifier focused solely on classifying tweets, then there shouldn't be a problem. The 140Senitment paper was written in 2009 though, the accepted forms of TWeet-speak (and userbase) may have changed drastically enough to score signfiicantly lower than the 83% peak claimed by 140Senitment's creators.

Bayes Theorem

At Google: "All your Bayes are belong to us"

GoogleTechTalks: Peter Norvig, Past, Present, Future Vision of AI - Google and AAAI 2011

Googler: "Google uses Bayesian filtering the way [my former employer] uses the if statement"

“And it was fun looking at the comments, because you’d see things like ‘well, I’m throwing in this naive Bayes now, but I’m gonna come back and fix it it up and come up with something better later.’ And the comment would be from 2006. [laughter] And I think what that says is, when you have enough data, sometimes, you don’t have to be too clever about coming up with the best algorithm.”

"What is it that makes Google unique?"

Data - "We got a lot of data…And we do complicated things…[but] you can fall back on, 'Everything I ever needed to know, I learned from Sesame Street'…basically all we're doing is counting.
Working at scale - "I talked to some of the SRE engineers, and they said, 'Pst, We'd never run down to a petabyte, that could never happen'"
Control - It's not just that there's data out there that we're observing, it's that we get to interact with the world…we get to do experiments, and say, 'if we take this intervention, how is that going to change things?'

Bayesian statistics

An Intuitive Explanation of Bayes’ Theorem

"An Intuitive (and Short) Explanation of Bayes’ Theorem"

Explanation with Legos

via Eliezer Yudkowsky:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening.

What is the probability that she actually has breast cancer?

Simple example:

Out of a total of 100 work days, an employee:

P(A) - Called in absent 10 days out of 100: 10/100
P(B) - Had bourbon 80 days out of 100: 80/100
P(B|A) - On the 10 absent days, had bourbon on 7 of them: 7/10

What is P(A|B), the chance that if bourbon is had today, that the employee will be calling in absent?

Or another way to put it: when the boss tells the employee that he'll be fired if he misses another day, the employee makes sure to have bourbon the very next morning and every morning afterwards.

What does this look like in Bayes?

P(A) * P(B|A) 
-------------
    P(B)

   10      7
 ----- * ------
  100      10
------------------
      80
     -----
      100

Naive Bayes

Bayes' Theorem is particularly useful in computational processes. One common implementation is the naive Bayes classifiers, in which conditional probabilities are simply thrown together regardless of whether they are independent of each other.

Advantages of naive Bayes:

Fast - Ignorance, and naivety, is bliss
Robust - Machine learning needs to handle the unknown unknowns – this includes words/slang and data features never anticipated by the programmer – without falling on its face. Naive Bayes
Surprisingly accurate - just as Bayes' Theorem itself can seem counter-intuitive, naive Bayes succeeds far more than

Senators example

Given the current political makeup of the U.S. Senate, and some (made-up) assessments of their stances on abortion and the death penalty, how can naive Bayes be used to the political party of an incoming Senator?

Note: this isn't a great real-world scenario; the professed party of a U.S. Senator is always known, but I'm keeping it simple here. A better use would be: knowing senators' public stances on some issues (gun rights, immigration, abortion) and other characteristics (age, region, voting pattern), use Bayes to guesstimate their stance on broader issues (climate change, taxes, foreign policy).

Senators on the basic issues

Senators who Snapchat

In situations in which something completely new is encountered, the rare event can skew predictions wildly. Here, we have a situation in which the relatively hip Independent Senators are on record as using SnapChat, but no sitting Republican or Democratic Senators have yet used it.

To a less flexible algorithm, this might be interpreted as "Past Republican and Democratic Senators absolutely do not like SnapChat.". However, in reality, it may be that sitting Senators just have missed the prime age to be trying SnapChat, because SnapChat is so new. But the way Naive Bayes is currently set up, this model can never account for that because the 0 Republicans/Democrats just zeroes out all chance that a Republican or Democrat some day will be a SnapChatter.

So how do we encode such a seemingly simple concept?

Senators who Snapchat smoothed

Additive smoothing, in which the numerator (and denominator, to keep things even) is incremented, adds a little statistical noise but gives an "out" to the predictive model:

NLTK

Gender identification via Bayes

Train a classifier to guess the gender of a name without the use of Social Security Administration statistics. While the statistical method is likely to be more accurate, this kind of classifier could work for specialized use cases, such as making up a totally new name and estimating whether it seems more "male" or "female".

import nltk
from nltk.corpus import names
import random

m_names = names.words('male.txt')
f_names = names.words('female.txt')
labeled_names = ([(name, 'male') for n in m_names] + 
  [(name, 'female') for n in f_names])

# shuffle them up
random.shuffle(labeled_names)

# first run
def dumb_features(word) :
  return { 'first': word[0] }

sets = [(dumb_features(n), sex) for (n, sex) in labeled_names]
train_dumb, test_dumb = sets[1000:], sets[:1000]
dumb_classifier = nltk.NaiveBayesClassifier.train(train_dumb)

dumb_classifier.classify(dumb_features('Daniel'))
dumb_classifier.classify(dumb_features('Bilbo'))
dumb_classifier.classify(dumb_features('Jack'))
# see the results
print(nltk.classify.accuracy(dumb_classifier, test_dumb))
dumb_classifier.show_most_informative_features(5)


## another classifier
def features_a(word) :
  return { 'last': word[-1] }
sets = [(features_a(n), sex) for (n, sex) in labeled_names]
train_a, test_a = sets[1000:], sets[:1000]
classifier_a = nltk.NaiveBayesClassifier.train(train_a)

classifier_a.classify(features_a('Daniel'))
# see the results
print(nltk.classify.accuracy(classifier_a, test_a))
classifier_a.show_most_informative_features(5)

#######################
def features_b(w) :
  word = w.lower()
  return { 'last': word[-1] ,
           'last2': word[-2:]
          }
sets = [(features_b(n), sex) for (n, sex) in labeled_names]
train_b, test_b = sets[1000:], sets[:1000]
classifier_b = nltk.NaiveBayesClassifier.train(train_b)

classifier_b.classify(features_b('Daniel'))
# see the results
print(nltk.classify.accuracy(classifier_b, test_b))
classifier_b.show_most_informative_features(5)

#######################
def features_c(w) :
  word = w.lower()
  return { 'last': word[-1] ,
           'last2': word[-2:],
            'first2': word[0:1]
          }

sets = [(features_c(n), sex) for (n, sex) in labeled_names]
train_c, test_c = sets[1000:], sets[:1000]
classifier_c = nltk.NaiveBayesClassifier.train(train_c)

classifier_c.classify(features_c('Daniel'))
# see the results
print(nltk.classify.accuracy(classifier_c, test_c))
classifier_c.show_most_informative_features(5)
#######################

def features_d(w) :
  word = w.lower()
  vowel_count = word.count('a') + word.count('e') + word.count('i') + word.count('o') + word.count('u')
  return { 
            'first': word[0] ,
            'last': word[-1] ,
           'last2': word[-2:],
           'vowel_ratio': round(vowel_count * 100.0 / len(word))
          }

sets = [(features_d(n), sex) for (n, sex) in labeled_names]
train_d, test_d = sets[1000:], sets[:1000]
classifier_d = nltk.NaiveBayesClassifier.train(train_d)

classifier_d.classify(features_d('Daniel'))
# see the results
print(nltk.classify.accuracy(classifier_d, test_d))
classifier_d.show_most_informative_features(5)

Other NLTK

Twitter sentiment analysis using Python and NLTK - A great implementation by Laurent Luce.