Homework

404-Finder

Write a program to auto-detect broken links

Due: Tuesday, March 10

Points: 5 (Extra Credit)

(Note: I’m still in the process of writing the hints and requirements to this assignment, though you’re welcome to try drafting out ideas) Link-rot, or URLs that have been mistyped, is a common problem in web publishing. In this assignment, you’ll write a practical tool that uses curl to quickly check which hyperlinks on a given web page are broken.

Deliverables

A folder named "404-finder" in your Github repo

A script named 'url-checker.sh'

The url-checker.sh script takes in one argument: a URL to gather up hyperlinks from. Then the script visits each URL and retrieves the HTTP status code, i.e. 200, 404

Usage:

      bash url-checker.sh http://www.example.com

The output of url-checker.sh is a comma-delimited list containing two columns: the URL, and its status code, sorted by the URLs in alphabetical order:

      http://en.wikipedia.org/,200
      http://www.example.com/broken,404
      http://www.example.com/hello,200
      https://www.facebook.com,200

The url-checker.sh script should only check each URL on a given page exactly once. And if relative URLs are found on a page, url-checker.sh will have to resolve them to absolute URLs before visiting them.

Hints

You will want to understand conditional branching, how to write a for loop, how to write a reusable script, how to use pup for parsing HTML, and how to curl for headers.

Steps

You should partition your url-checker.sh script into a few phases:

1. Gather the URLs

Use the pup tool to extract all of the href's from the given page.

2. Normalize the URLs

Some of the URLs may be relative, i.e. if you visit "http://www.whitehouse.gov", you might find a href that points to the relative URL, "/pictures/index.html". Your script will need to translate this URL into its absolute form, i.e. "http://www.whitehouse.gov/pictures/index.html"

3. Filter the URLs

Some URLs may be repeated on the page, so use sort and uniq to create a list of unique URLs.

4. Visiting the URLs

Use curl and its various options for finding the HTTP response code for a given URL. Do not output the content of each visited link. The purpose of url-checker.sh is to find the HTTP status of each URL, not save their content.