The Celebrity (Tw)It List

Finding out who the most-followed users follow on Twitter.

Due: Tuesday, March 10
Points: 5 (Extra Credit)

Using the t command-line Twitter tool, find the most popular Twitter users, and then find who they follow. The result is, purportedly, an elite list of Twitter users who have the most influence over “celebrities”. This is a (hopefully) fun exercise in traversing networks with data, as we try to compile data by following links from one datapoint to the next.

Deliverables

  • A project directory named twitter-celeb-list

    The project folder will look like this:

      |-- compciv/
        |-- homework/
          |-- twitter-celeb-list/
            |-- t-bouncer.sh (get info on "celebrities" from Twitter API)
            |-- t-lister.sh (generate list of highly-followed users)
            |-- top-100.csv (the list of 100-most followed users and who follows them)
            |-- data-hold/
              |-- followings/ (cache the followings list of users)
    
  • t-bouncer.sh

    The t-bouncer.sh script should take in, as an argument, a CSV or list of CSV files of Twitter users, as produced by the t program. It then filters this listing of users like so:

    1. Filter the list for users for “celebrities” who actually use Twitter. This is how I’m defining such a “celebrity”:

      • Is verified - this is some assurance that this person has a brand/identity important enough for Twitter to vouch for.
      • Has more than 50,000 followers - Another small assurance of real-world popularity
      • Is following 1,000 or fewer users - this is to ignore accounts that are probably just used for PR/brand-management purposes, rather than the person tweeting themselves.
      • Has created at least 1,000 tweets - if they aren’t tweeting much, then maybe they haven’t bothered to nitpick over their own followings-list.

      In other words, the following users will be filtered out as not being “celebrities” (at least, in the personal-Twitter-sense):

      img

    2. For the users who are allowed through the filter, i.e. the “celebrities”, t-bouncer.sh will then collect a list of whom they follow, and save it to a file:

      data-hold/followings/the_username_in_lowercase.csv

    In other words, data-hold/followings/ will contain files, each one named after a “celebrity’s” Twitter username. And within each of those files contains the list of people whom the “celebrity” follows.

  • t-lister.sh

    The t-lister.sh script expects data-hold/followings to contain a bunch of CSV files containing lists of followed Twitter users. It reads through all of these CSV files and counts up the usernames found in all the files (i.e. using a combination of sort and uniq -c)

    It then filters that list of username counts to include only usernames that appear 5 or more times (i.e. followed by at least 5 different celebrity users).

    And then, it outputs a list that looks like this:

      thedude,walter
      thedude,donny
      thedude,bunny
      thedude,brandt
      jefflebowski,bunny
      jefflebowski,brandt
      jefflebowski,woo
    

    Each line in that output is a comma-delimited data row of two columns: the username of a followed Twitter user, and the user who follows that user. In the above example, thedude is followed by walter, donny, bunny, and brandt

    The reason to include multiple rows per followed user is because it’s interesting to see who the followers are. And also, because it’s dead simple to get just the list of unique followed users:

      bash t-lister.sh | cut -d ',' -f 1 | sort | uniq
    
  • top-100.csv

    Create this file with the output produced by t-lister.sh. It will contain far more than 100 rows, but doing this:

      cat top-100.csv | cut -d ',' -f 1 | sort | uniq | wc -l
    

    – should result in 100

    Be aware that the names in this list do not themselves have to fit our definition of “celebrity”. For example, your top-100.csv will most likely contain user barackobama.

  • Hints

    Example usage

    This assignment's description is a little verbose and it may be hard to create a mental picture of the effect of t-bouncer.sh. So, first, try to read the entirety of the Hints section. And if you're confused, then here's a more visual depiction of how things should work:

    Getting a "seed"

    Let's say you're a big Jonah Hill fan. You do a Google search for his Twitter account, which is at @JonahHill, and you notice that he fits our definition of a Twitter celebrity (verified, 50,000+ followers, less than 1,000 users followed, more than 1,000 tweets made).

    img

    So you manually download his followings list into your currently empty data-hold/followings/ folder. You can think jonahhill.csv as being the seed:

    t followings jonahhill > data-hold/followings/jonahhill.csv
    

    OK, now data-hold/followings/ contains a single file:

      data-hold/followings/jonahhill.csv
    

    First run of t-bouncer.sh

    Now, if you've designed t-bouncer.sh properly, you can execute it like this:

    bash t-bouncer.sh data-hold/followings/*.csv
    

    As it turns out, @JonahHill, at the time of writing, follows exactly one user: @BlondieOfficial.

    img

    Now, if @BlondieOfficial didn't meet our definition of a "celebrity", the t-bouncer.sh script would do nothing – because it begins by first filtering each CSV file for "celebrities".

    However, it just so happens that @BlondieOfficial meets our definition of a Twitter "celebrity", and so t-bouncer.sh will fetch her list of followings and save it into data-hold/followings/blondieofficial.csv.

    So now data-hold/followings contains:

    Second run of t-bouncer.sh

    You can run t-bouncer.sh the same way as the first:

    bash t-bouncer.sh data-hold/followings/*.csv
    

    Because you've told t-bouncer.sh to go through every CSV in data-hold/followings, it will again filter jonahhill.csv and see that it contains a "celebrity" listing, @BlondieOfficial.

    But if you've designed t-bouncer.sh to not fetch already-fetched files, i.e. data-hold/followings/blondieofficial.csv, then it should skip fetching that, and then move on to filtering data-hold/followings/blondieofficial.csv for celebrities.

    As of time of writing, @BlondieOfficial follows 218 users, 80 of whom are celebrities. So t-bouncer.sh will end up downloading 79 new CSV files (as it turns out @BlondieOfficial also follows @JonahHill, and we already have data-hold/followers/jonahhill.csv)

    So here's what your data-hold/followers/ directory will look like after each step:

    Seed 1st run 2nd run
    jonahhill.csv jonahhill.csv
    blondieofficial.csv
    jonahhill.csv
    blondieofficial.csv
    aaronpaul_8.csv
    abcnetwork.csv
    actuallynph.csv
    aliciakeys.csv
    beatsmusic.csv
    bettemidler.csv
    bloomingdales.csv
    cher.csv
    chrislilley.csv
    coachella.csv
    conanobrien.csv
    cyndilauper.csv
    davejmatthews.csv
    ditavonteese.csv
    dixiechicks.csv
    ellenpage.csv
    fabnewyork.csv
    genesimmons.csv
    goldiehawn.csv
    gothamist.csv
    gracehelbig.csv
    grantland33.csv
    greatdismal.csv
    guardianmusic.csv
    haimtheband.csv
    harpersbazaarus.csv
    hitrecord.csv
    howardstern.csv
    huffpostgay.csv
    i_d.csv
    iamrashidajones.csv
    itunesmusic.csv
    janemarielynch.csv
    jimmybuffett.csv
    jimmykimmel.csv
    johnstamos.csv
    jpgaultier.csv
    jtimberlake.csv
    karllagerfeld.csv
    katyperry.csv
    kevinspacey.csv
    kimletgordon.csv
    lenadunham.csv
    lilyallen.csv
    lordemusic.csv
    louisvuitton.csv
    louisvuitton_uk.csv
    mileycyrus.csv
    mrssosbourne.csv
    newyorker.csv
    nylonmag.csv
    nymag.csv
    nytimes.csv
    officialrodarte.csv
    oprah.csv
    ournameisfun.csv
    paulmccartney.csv
    pearljam.csv
    pink.csv
    rollingstone.csv
    russellcrowe.csv
    ryanseacrest.csv
    spinmagazine.csv
    stevemartintogo.csv
    susansarandon.csv
    thewho.csv
    time.csv
    u2.csv
    unrightswire.csv
    vanityfair.csv
    vicenews.csv
    vmagazine.csv
    voguemagazine.csv
    wmag.csv
    worldmcqueen.csv
    wsj.csv
    wwd.csv
    youtube.csv
    zooeydeschanel.csv

    As you can see, the contents of data-hold/followings/ are likely to grow exponentially, which makes sense, if we assume that interesting/elitist people will themselves follow interesting/elitist people.

    However, in the above example, you'll need to run t-bouncer.sh a third time, as you need a minimum of 100 CSV files (i.e. the followings of 100 "celebrities") in data-hold/followings/.

    Hints

    Using iconv

    This is important.

    I highly suggest using csvfix for parsing the CSV files. That said, csvfix has a near-fatal flaw: it doesn't like UTF-8 encoded files. If you really care about the byte-level details of what UTF-8 is, and how it makes programmers' lives a frequent hell, check out Joel Spolsky's article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

    For our purposes, it means that things like emoticons-as-text can cause errors:

    img

    The easy-fix is to use the iconv program to delete all non-ASCII (i.e. simple American-English) characters, which is fine for our purposes because Twitter screen names can only contain ASCII characters anyway. Here's an example usage:

      t followings --csv stanford | 
          iconv -c -f utf-8 -t ascii > data-hold/followings/stanford.csv
    

    Where do I get a list of celebrities

    What might be slightly confusing is: well, where do we get a list of "celebrities" to begin with?

    The answer: From any list of users that the t program generates.

    For example, you could start with the list of users followed by @Stanford:

        t followings --csv stanford |
          iconv -c -f utf-8 -t ascii > data-hold/followings/stanford.csv
    

    I've saved a cached list of this CSV here, which you can curl at your convenience. At the time of my writing this, @Stanford follows 931 users, and at least a few of them meet my requirements for being a "celebrity".

    We can use csvfix to quickly filter this list (see its documentation for all of its commands here). For example, to find all verified users (i.e., users which have true for the Verified column) with more than 50,000 followers:

        cat data-hold/followings/stanford.csv | 
          csvfix find -f 11 -s 'true' |
          csvfix find -if '$8 > 50000'
    

    I'll let you write the rest of the filters needed (making more calls to csvfix find should be sufficient) to meet my stated requirements for a "celebrity": verified, more than 50,000 followers, fewer than 1,000 followed users, and more than 1,000 tweets. If you run your filtering command on the sample list here, you should get a list of 50 users, including AmbassadorRice, Atul_Gawande, BillGates, CarlyFiorina.

    Your t-bouncer.sh script should run like this:

        bash t-bouncer.sh data-hold/followings/stanford.csv
    

    And for every "celebrity" Twitter user in that CSV, the t-bouncer.sh uses t to download and save a new list of followings. So if the stanford followings CSV contains 50 eligible "celebrities", t-bouncer.sh should have created 50 new CSV files in data-hold/followings/

    How do I get enough usernames to fill out top-100.csv?

    So no matter if data-hold/followings/ contains one CSV or 1,000 CSV, t-lister.sh should do its job of collating/counting up the "cool" users. However, depending on how many followings-lists you've collected, t-lister.sh may not return enough followed-users…remember, the criteria for inclusion on top-100.csv is that a user is followed by at least three different users…in other words, an eligible user name exists in at least three or more CSVs inside data-hold/followings/.

    So at a minimum, you need to have collected the followings-list for at least 100 "celebrities". If all of these celebrities have 5-or-more followed-users in common, then your t-lister.sh script will have enough unique user names to fill out top-100.csv

    Remember that top-100.csv may contain hundreds/thousands of rows, since it contains the pairs of usernames and the celebrities who follow them. But running:

        cat top-100.csv | cut -d ',' -f 1 | sort | uniq | wc -l
    

    – should yield a result of 100

    How do get more and more lists of "celebrities"?

    Let's say the @Stanford list of followed users doesn't generate enough users for top-100.csv, where do you go to next for more eligible celebrities?

    Well, why not gather new celebrities from the list of users contained within all the CSVs in data-hold/followings/? After all, the users that @Stanford follows must themselves follow a few eligible "celebrities".

    So basically, you can keep doing this:

        bash t-lister.sh | cut -d ',' -f 1 | sort | uniq | wc -l
        # is the output 100? If not, then:
        bash t-bouncer.sh data-hold/followings/*.csv
        bash t-lister.sh | cut -d ',' -f 1 | sort | uniq | wc -l
        # is the output 100? If not, then:
        bash t-bouncer.sh data-hold/followings/*.csv
        bash t-lister.sh | cut -d ',' -f 1 | sort | uniq | wc -l
        # is the output 100. If so, then:
        bash t-lister.sh > top-100.csv
        git add --all
        git commit -m "I'm all done"
        git push
    

    If you really can't be bothered to keep repeating that sequence of commands…remember the while construct?

    user_count=0
    while [[ $user_count < 100 ]]; do
      echo "Elite user count at $user_count"
      bash t-bouncer.sh data-hold/followings/*.csv
      user_count=$(bash t-lister.sh | cut -d ',' -f 1 | sort | uniq | wc -l)
    done
    

    Warning: if you start out with a bad "seed", i.e. subsequent runs of t-bouncer.sh don't produce any new "celebrities", from which you can continue to branch out…then this while loop will run forever. You can prevent that by creating another condition; for example, the following code uses another variable to count the total number of times that the loop has executed, quitting after 10 times (remember that it's possible for t-bouncer.sh to retrieve data for an exponentially-increasing number of users with each execution):

    loop_count=0
    user_count=0
    while [[ $user_count < 100 && loop_count < 10 ]]; do
      echo "Elite user count at $user_count (Loop has run $loop_count times)"
      bash t-bouncer.sh data-hold/followings/*.csv
      user_count=$(bash t-lister.sh | cut -d ',' -f 1 | sort | uniq | wc -l)
      loop_count=$((loop_count+1))
    done
    

    Watch out for rate-limits

    Remember that Twitter's API has a rate limit. For our purposes, this means that you can't get the list of followers for no more than 15 "celebrities" per 15 minutes.

    So t-bouncer.sh must include at least two features:

    1. Inside t-bouncer.sh, you'll likely have some kind of loop, because for each eligible username, you'll be calling t followings. So in that loop, you should be running the sleep command to sleep for at least 60 seconds.
    2. You should not be re-collecting data for users for which you already have CSV files for in data-hold/followings/. In other words, you need an if-statement.

    So your t-bouncer.sh script should contain logic that looks like this:

    # assume that this part of the code is within a loop in which $username 
    # contains a Twitter username to possibly fetch the followings-data for
    
        lowercase_name=$(echo $username | tr '[:upper:]' '[:lower:]')
        filename="data-hold/followings/$lowercase_name.csv"
        if [[ -s "$filename" ]]; then
          echo "Already have followings-list for $lowercase_name"
        else
          echo "Getting followings for $lowercase_name"
          # run your code to execute the appropriate t program and save the data
          # ...
          sleep 60
          echo "Now sleeping for 60 seconds"
        fi
    # ... and continue with the rest of the t-bouncer.sh scritp
    
    

    Note: There are other, more efficient and cleverer ways to design this loop. In fact, it's probably better to include a limit of new files to fetch within t-bouncer.sh, because t-bouncer.sh could conceivably run for what will seem like forever.

    Here's an alternative, which incorporates an if-statement within an if-statement:

    # assume that this part of `t-bouncer.sh` is within a loop in which 
    # the username variable is filled in
    
    # in addition, assume that users_fetched variable has been set to 0
    # at the beginning of the `t-bouncer.sh`
    
      if [[ $users_fetched -lt 15 ]]; then    
        lowercase_name=$(echo $username | tr '[:upper:]' '[:lower:]')
        filename="data-hold/followings/$lowercase_name.csv"
        if [[ -s "$filename" ]]; then
          echo "Already have followings-list for $lowercase_name"
        else
          echo "Getting followings for $lowercase_name; $users_fetched users fetched so far"      # run your code to execute the appropriate t program and save the data
          # ...
          # instead of sleeping for 60 seconds, just move on to the
          # next username...let whatever calls `t-bouncer.sh` do the 
          # necessary sleeping
          users_fetched=$((users_fetched + 1))
        fi
      fi
    
    ## continue on with the rest of `t-bouncer.sh`
    

    The effect of your starting point

    If you decide to start off by pointing t-bouncer.sh to a different list than @Stanford's followings, say, the 137 users followed by @SarahPalinUSA, the result of your top-100.csv may look very different. In other words, the make-up of top-100.csv, which consists of the output of the arbitrary cut-off – 100 users with 5-or-more-followers among the celebrities you've collected – from t-lister.sh, depends very much on the kind of users who exist on your "celebrities" followed-lists and in which order you collect them.