(Note: I’m still in the process of writing the hints and requirements to this assignment, though you’re welcome to try drafting out ideas) Link-rot, or URLs that have been mistyped, is a common problem in web publishing. In this assignment, you’ll write a practical tool that uses curl to quickly check which hyperlinks on a given web page are broken.
The url-checker.sh
script takes in one argument: a URL to gather up hyperlinks from. Then the script visits each URL and retrieves the HTTP status code, i.e. 200
, 404
Usage:
bash url-checker.sh http://www.example.com
The output of url-checker.sh
is a comma-delimited list containing two columns: the URL, and its status code, sorted by the URLs in alphabetical order:
http://en.wikipedia.org/,200
http://www.example.com/broken,404
http://www.example.com/hello,200
https://www.facebook.com,200
The url-checker.sh
script should only check each URL on a given page exactly once. And if relative URLs are found on a page, url-checker.sh
will have to resolve them to absolute URLs before visiting them.
You will want to understand conditional branching, how to write a for loop, how to write a reusable script, how to use pup for parsing HTML, and how to curl for headers.
You should partition your url-checker.sh
script into a few phases:
Use the pup tool to extract all of the href
's from the given page.
Some of the URLs may be relative, i.e. if you visit "http://www.whitehouse.gov", you might find a href
that points to the relative URL, "/pictures/index.html"
. Your script will need to translate this URL into its absolute form, i.e. "http://www.whitehouse.gov/pictures/index.html"
Some URLs may be repeated on the page, so use sort
and uniq
to create a list of unique URLs.
Use curl and its various options for finding the HTTP response code for a given URL. Do not output the content of each visited link. The purpose of url-checker.sh
is to find the HTTP status of each URL, not save their content.