404-Finder

Write a program to auto-detect broken links

Due: Tuesday, March 10
Points: 5 (Extra Credit)

(Note: I’m still in the process of writing the hints and requirements to this assignment, though you’re welcome to try drafting out ideas) Link-rot, or URLs that have been mistyped, is a common problem in web publishing. In this assignment, you’ll write a practical tool that uses curl to quickly check which hyperlinks on a given web page are broken.

Deliverables

  • A folder named "404-finder" in your Github repo

  • A script named 'url-checker.sh'

    The url-checker.sh script takes in one argument: a URL to gather up hyperlinks from. Then the script visits each URL and retrieves the HTTP status code, i.e. 200, 404

    Usage:

          bash url-checker.sh http://www.example.com
    

    The output of url-checker.sh is a comma-delimited list containing two columns: the URL, and its status code, sorted by the URLs in alphabetical order:

          http://en.wikipedia.org/,200
          http://www.example.com/broken,404
          http://www.example.com/hello,200
          https://www.facebook.com,200                  
    

    The url-checker.sh script should only check each URL on a given page exactly once. And if relative URLs are found on a page, url-checker.sh will have to resolve them to absolute URLs before visiting them.

  • Hints

    You will want to understand conditional branching, how to write a for loop, how to write a reusable script, how to use pup for parsing HTML, and how to curl for headers.

    Steps

    You should partition your url-checker.sh script into a few phases:

    1. Gather the URLs

    Use the pup tool to extract all of the href's from the given page.

    2. Normalize the URLs

    Some of the URLs may be relative, i.e. if you visit "http://www.whitehouse.gov", you might find a href that points to the relative URL, "/pictures/index.html". Your script will need to translate this URL into its absolute form, i.e. "http://www.whitehouse.gov/pictures/index.html"

    3. Filter the URLs

    Some URLs may be repeated on the page, so use sort and uniq to create a list of unique URLs.

    4. Visiting the URLs

    Use curl and its various options for finding the HTTP response code for a given URL. Do not output the content of each visited link. The purpose of url-checker.sh is to find the HTTP status of each URL, not save their content.