Collecting data about your Instagram feed

An exercise in using the Instagram API to get your own media collection.

This page is meant mostly as another example of interacting with the Instagram API. For some of the HTML tutorials, I'll be using the data from government Instagram accounts. But maybe you want to use your own data? Then follow the steps in this guide.

You should start with the montage the world with Instagram and Google guide

Note: This guide is under a bit of construction

Gathering your Instagram photos

We will be accessing the /users/(userid)/media/recent endpoint, which is documented here.

In bold are the parameters we care about:

You'll need your own ACCESS_TOKEN. And use your own USERNAME instead of mypubliclands (which belongs to the U.S. Bureau of Land Management)

# paste in your own token if you didn't initialize it with .bashrc
ACCESS_TOKEN="$INSTAGRAM_TOKEN"
BASE_URL='https://api.instagram.com/v1'
USERNAME='mypubliclands'

Deriving the Instagram user ID from the username

You might know your Instagram username, but you may not know your Instagram user_id. So we need to use the users/search endpoint

curl "$BASE_URL/users/search?access_token=$ACCESS_TOKEN&q=$USERNAME" > profile.json

Inside profile.json, the response for mypubliclands looks like this:

{
  "meta": {
    "code": 200
  },
  "data": [
    {
      "username": "mypubliclands",
      "bio": "The Bureau of Land Management manages more than 245 million acres of public land, with some of the most breathtaking landscapes anywhere on earth.",
      "website": "http://www.blm.gov",
      "profile_picture": "https://instagramimages-a.akamaihd.net/profiles/profile_225371825_75sq_1348864741.jpg",
      "full_name": "Bureau of Land Management",
      "id": "225371825"
    }
  ]
}

Your response may have more than one result in the data array, so eyeball the appropriate id value. If you happen to be the most popular account in the array of results, you can grab the id programmatically by assuming that you are the first in the results:

cat profile.json | jq -r '.data[0] .id'
# 225371825

And now we can set a USER_ID variable for the next step:

USER_ID='225371825'

Paginating the Instagram recent media results

Let's set some variables:

COUNT="50"
MEDIA_ENDPOINT="https://api.instagram.com/v1/users/$USER_ID/media/recent?access_token=$ACCESS_TOKEN&count=$COUNT"

To get the data from your first 50 photos:

curl -o temp.json $MEDIA_ENDPOINT

This endpoint returns images in reverse chronological order, i.e. most recent, first. If we want anything older than the 50 most recent images, we have to use the max_id parameter, which is described as:

MAX_ID Return media earlier than this max_id.

So, among the images that we've just curled into temp.json, we need the id of the oldest image. And if the images in temp.json are sorted starting from most recent, that means we want the last image in temp.json.

Using jq:

cat temp.json | jq --raw-output '.data | reverse[0] .id'
237497413860994153_181309234

So let's pass it in to the API call as the value for max_id:

nid=$(cat temp.json | jq --raw-output '.data | reverse[0] .id')
curl "$MEDIA_ENDPOINT&max_id=$nid" -o "$nid.json"

Whatever exists in $nid.json should be a new batch of photos (assuming you have more than 50 photos).

So basically, the routine for collecting all the photos is:

# get the first batch of photos
curl $MEDIA_ENDPOINT -o first_batch.json 

# get the 2nd batch, based on the `id` of the 
# *last* photo in the current batch
nid=$(cat first_batch.json | jq --raw-output '.data | reverse[0] .id')
curl "$MEDIA_ENDPOINT&max_id=$nid" -o "$nid.json"

And at this point, you just repeat the following code:

# get the nth batch...and repeat
nid=$(cat $nid.json | jq --raw-output '.data | reverse[0] .id')
echo $nid
curl -s "$MEDIA_ENDPOINT&max_id=$nid" -o "$nid.json"

– until you get an empty batch, i.e. $nid is equal to null

Optional looping just because

If you have a lot of photos, then copying-and-pasting should feel tedious. So now is a chance to practice your while loops. You don't have to do this, but you should aspire to learn it as it's the only way to make a hands-free automated process of re-downloading your Instagram data at will.

There are many ways to structure the loop; I've chosen a straightforward, if inelegant way:

  1. The variable nid is set to an empty string, i.e. ''
  2. The while loop initiates and keeps going until $nid is not null. Note that null is not the same as the empty string, '', which is how we enter the loop in the first place
  3. The curl is stored in the response variable
  4. Then we use jq to parse $response, by looking at the .data array, reversing it, and then grabbing the .id of the 0th element, which we assign to the nid variable
  5. If there is at least one item in the data array, the $nid is not null, and we save the JSON contained in $response into $nid.json (that way, each response is in a different JSON file)
  6. We sleep for 2 seconds, then return to the top of the loop. This process keeps going until $nid is null
nid=''
while [[ $nid != null ]]; do
  # if the max_id is empty, you'll get the most recent photos by default
  response=$(curl -s "$MEDIA_ENDPOINT&max_id=$nid")
  nid=$(echo $response | jq --raw-output '.data | reverse[0] .id')
  echo "nid: $nid"  
  if [[ $nid != null ]]; then
    # name each response with the id of its oldest photo
    echo $response > $nid.json
  fi
  sleep 2
done
# For `mypubliclands', the echo output will look like this:
#nid: 886567966460382181_225371825
#nid: 853735645642525166_225371825
#nid: 823860786682955993_225371825
#nid: 804346466643249380_225371825
#nid: 776955719837321679_225371825
#nid: 742745337740692263_225371825
#nid: 704359738076072008_225371825
#nid: 673222394963069579_225371825
#nid: 648089023618518012_225371825
#nid: 584804608801278362_225371825
#nid: 568969379649981259_225371825
#nid: 526931139443330000_225371825
#nid: 502980035944941622_225371825
#nid: 470412227080128899_225371825
#nid: 415184442418301822_225371825
#nid: 390018053906925925_225371825
#nid: 363732785461758349_225371825
#nid: 342023866460965417_225371825
#nid: 318780852670335378_225371825
#nid: 302846550820555048_225371825
#nid: 291165315748628823_225371825
#nid: 290833966017841191_225371825
#nid: null

Examining the data

Once you're done, you should have a few .json files, and you can do this to see the direct URLs to every Instagram image of yours:

cat mypubliclands/*.json | jq -r '.data[] .link'
cat mypubliclands/*.json | jq -r '.data[] .likes .count'

Finding your top 25 photos

(Under construction, but basically, use jq)

Looping through our scenic government Instagrams

The following is just a proof-of-concept. You don't need to execute it yourself, but it's a demonstration of taking the previous logic and throwing it in a loop, i.e., having a loop within a loop, so that we can save the data for multiple U.S. government agencies' Instagrams:

Collecting all of their images and profile data

This loop does what we did previously, though I keep things organized by creating a rawjson directory and then a subdirectory inside rawjson for each agency.

ACCESS_TOKEN="$INSTAGRAM_TOKEN"
BASE_URL='https://api.instagram.com/v1'

for username in glaciernps nasa usinterior mypubliclands USFWS smithsonian; do 
  echo Fetching $username
  sleep 1
  mkdir -p ./rawjson/$username
  
  curl -o ./rawjson/$username/profile.json \
    -s "$BASE_URL/users/search?access_token=$ACCESS_TOKEN&q=$username" 
  # Get the user_id from the username
  user_id=$(cat ./rawjson/$username/profile.json | jq -r '.data[0] .id')
  echo "Found user_id of $user_id for $username"

  MEDIA_ENDPOINT="https://api.instagram.com/v1/users/$user_id/media/recent?access_token=$ACCESS_TOKEN&count=$COUNT"  

  # begin the loop to paginate all the photo data
  nid=''
  while [[ $nid != null ]]; do
    # if the max_id is empty, you'll get the most recent photos by default
    response=$(curl -s "$MEDIA_ENDPOINT&max_id=$nid")
    nid=$(echo $response | jq --raw-output '.data | reverse[0] .id')
    echo "$username,$user_id   nid: $nid"  
    if [[ $nid != null ]]; then
      # name each response with the id of its oldest photo
      echo $response > "./rawjson/$username/$nid.json"
    fi
    sleep 1
  done
done

Now get all of the longitude/latitude


mkdir -p "geocoded/images"
# make the top 25 images
datafile="geocoded/data.json"

cat ./rawjson/*/*.json | \
  jq '.data[] | select(.location .latitude != null)' | \
  jq -s '.' > $datafile

while read -r img; do 
  id=$(echo $img | cut  -d ',' -f 1)
  url=$(echo $img | cut  -d ',' -f 2)
  
  echo $url
  # Download the image from instagram
  curl -s $url -o "geocoded/images/$id.jpg"
  sleep 1
done < <(cat $datafile | \
    jq -r '.[] | [.id, .images .standard_resolution .url] | @csv' | tr -d '"' )

Top 25

for username in glaciernps nasa usinterior mypubliclands USFWS smithsonian; do 
  mkdir -p "top25/$username/images"
  # make the top 25 images
  datafile="top25/$username/data.json"
  cat ./rawjson/$username/*.json | \
    jq '.data[]' | \
    jq -s '.' | \
    jq 'sort_by(.likes .count) | reverse[0:25]' > $datafile
  
  echo "Downloading top images for $username"
  sleep 3
  while read -r img; do 
    id=$(echo $img | cut  -d ',' -f 1)
    url=$(echo $img | cut  -d ',' -f 2)
    
    echo $url
    curl -s $url -o "top25/$username/images/$id.jpg"
    sleep 1
  done < <(cat $datafile | \
    jq -r '.[] | [.id, .images .standard_resolution .url] | @csv' | tr -d '"' )
done

Of course we can montage them all like this:

montage top25/*/images/*.jpg insta.jpg

Making a webpage

Now we have all the data we need to make a custom page of our Instagram photos. Let's make a simple page that consists of just a grid of our photos.

And while we're prototyping, instead of generating a page of hundreds of photos, let's just start with the top 20 photos by Like count…we can sort/extract those 20 photos with jq and make a top-photos.json to work from:

First, make a file that is an array of all of the images (this extracts the .data arrays from each file and combines them into one array stored inside all.json):

# I'm not sure why I have to use "slurp" here, but whatever, I'm
# going for it
cat *.json | jq '.data[]' | jq -s '.' > all.json

#
cat all.json | jq 'sort_by(".likes .count") | reverse[0:24]' > top.json

Basic iteration

cat > basic.html<<EOF
  <html>
  <body>
  <style>
  .images{ width: 900px; margin: auto auto; }
  .image{ float: left; width: 33% }
  </style>
  <div class="images">
EOF

for url in $(cat top.json | jq -r '.[] .images .low_resolution .url'); do 
  cat >> basic.html <<EOF
  <div class="image">
    <img src="$url">
  </div>
EOF
done


cat >> basic.html <<EOF
  </div> <!-- end of .images -->
EOF

Let's add some data

First we have to reformat JSON

cat top.json | jq '.[] | { link: .link,
    image_url: .images .standard_resolution .url, 
    thumb_url: .images .low_resolution .url, 
    caption: .caption .text, 
    created_time: .created_time,
    lat: .location .latitude, 
    lng: .location .longitude } | @json'

The script

# better.sh
cat > basic.html<<EOF
  <html>
  <head>
    <link rel="http://stash.compciv.org/assets/css/bootstrap.min.css">
  </head>
EOF
# insert jquery
# insert isotopejs
#
cat >> better.html<<EOF
  <body>
  <div class="container">
EOF

while read j; do 
  obj=$(echo $j | jq -r '.')
  img_url=$(echo $obj | jq '.image_url')
  caption=$(echo $obj | jq '.caption')

  echo $img_url


  cat >> better.html <<EOF
  <section class="row">
     <div class="col-sm-4"> 
       <div class="image">
         <img src="$img_url">
       </div>
     </div>
      <div class="col-sm-4"> 
       <div class="caption">
         $caption
       </div>
     </div>
  </section>
EOF
done < <(cat top.json | jq '.[] | { link: .link,
    image_url: .images .standard_resolution .url, 
    thumb_url: .images .low_resolution .url, 
    caption: .caption .text, 
    created_time: .created_time,
    lat: .location .latitude, 
    lng: .location .longitude } | @json')


cat >> better.html <<EOF
  </div> <!-- end of .container -->
EOF