Introduction to Basic Supervised Machine Learning using the Yelp Academic Dataset

These notes are in flux. Sorry.

(some more notes on ngrams have been split off to a separate article. More to come)

Machine learning is the science of getting computers to act without being explicitly programmed.

As you get better at programming, you might find yourself actually getting lazier. And this is the natural, nay, the preferred order of things. To quote from the glossary of Larry Wall's, Programming Perl (Second Edition), regarding the virtues of Laziness and Impatience:

Laziness: - the quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don't have to answer so many questions about it. Hence, the first great virtue of a programmer. Also hence, this book. See also impatience and hubris. (p.609)

Impatience - The anger you feel when the computer is being lazy. This makes you write programs that don't just react to your needs, but actually anticipate them. Or at least pretend to. Hence, the second great virtue of a programmer. See also laziness and hubris. (p.608)

So even if you don't yet have the next billion-dollar startup idea, just being able to delegate a menial information task to a computer, whether it be web-scraping or grepping or auto-tweeting the news – this is motivation enough for programming.

At the ground level, we can consider machine learning as just the next level of laziness and impatience. A program usually has to wait on the human to tell it when and how to act. If we could teach the program to act, to know what to do without hand-holding, then we, the humans, are freed from the work of hand-holding.

Useful machine learning: sentiment analysis

Sentiment analysis is a pretty good machine learning problem:

  1. If you need to read hundreds/thousands/millions of messages, day after day, and quickly determine if they are happy or angry, then a program can most certainly have a profound impact on your own happiness/anger.
  2. It may not be important to know, within seconds, whether a given message is happy or angry. However, if you are processing many of these messages, then you effectively need to be able to filter each message within seconds.
  3. Maybe there are situations in which near-perfect accuracy is needed in labeling messages "happy" or "angry". But most situations I can think of, sentiment analysis only has to act as a triage. That is, being able to cut the manual filtering of a 1,000 messages to 10 edge cases is a huge win. And there really isn't any non-computational alternative to achieving that triaging.
  4. Most importantly, it is not extremely easy for a human to define what is a "happy" or "angry" message. While I can easily determine the sentiment of any given message – e.g. that neither "You have all the virtues I dislike and none of the vices I admire" nor "I hope we can be better strangers", are particularly positive – I'd have trouble explaining to another human, nevermind a program, how to detect sarcasm or snark. And that's for things I've seen; how do I prepare a computer for all the backhanded compliments I've yet to experience?

The number of possible known unknowns and unknown unknowns in sentiment analysis is what makes it a ripe problem for machine learning. Automated decision-making may be far from perfect, but it's sure better than trying to enumerate all the possible scenarios by hand – and so we're back to the virtues of laziness and impatience.

(for instance, Stanford's NaSent system, which claims a state of the art 85% success rate in classifying sentiment of individual sentences.)

Examples of sentiment analysis

Bag of words

More than bag of words

Bag of words classifiers can work well in longer documents by relying on a few words with strong sentiment like ‘awesome’ or ‘exhilarating.’ However, sentiment accuracies even for binary positive/negative classification for single sentences has not exceeded 80% for several years. For the more difficult multiclass case including a neutral class, accuracy is often below 60% for short messages on Twitter (Wang et al., 2012).

From a linguistic or cognitive standpoint, ignoring word order in the treatment of a semantic task is not plausible, and, as we will show, it cannot accurately classify hard examples of negation. Correctly predicting these hard cases is necessary to further improve performance.

Stanford's NaSent (short for Neural Analysis of Sentiment), is an approach that considers the structure of an entire sentence. The Stanford parser was used to extract more than 215,000 unique phrases from a dataset of 11,855 single sentences. Each of these phrases were given an intensity rating, by humans, of positive or negative sentiment.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank - by Stanford University's Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts.


It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines

Additional resources

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman, including Chapter 1 - Data Mining, Chapter 3 - Finding Similar Items, Chapter 12 - Large-Scale Machine Learning

Data Smart: Using Data Science to Transform Information into Insight

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, including Chapter 6: Scoring, term weighting and the vector space model Chapter 13: Text classification and Naive Bayes

An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper

Tom Mitchell's Machine Learning courses at Carnegie Mellon

Alex Holehouse's notes for the Fall 2011 session of the Stanford Machine Learning course

CS 229 Machine Learning - Autumn 2014

Stanford Machine Learning on Coursera. The on-demand version

Probabilistic Programming and Bayesian Methods for Hackers