Creating reusable shell scripts

How to design and package your code so that it can be re-used in future scenarios.

This is an extension to a previous tutorial, "Creating Basic Shell Scripts".

Shakespearean word counter

Let's move on to another more complicated toy example: we'll write a shell script named top-shake-words.sh that will have this usage:

Given a single argument – a reference to a Shakespearean play, e.g. lear, – the top-shake-words.sh script returns the top 10 words by frequency of occurrence (regardless of capitalization) in that play:

# given a reference to a play, like 'lear'
bash top-shake-words.sh lear

    910 the
    737 and
    576 to
    476 of
    463 you
    458 my
    363 that
    303 in
    282 not
    282 king

Then design top-shake-words.sh so that it can take a second argument: list only the words that have a certain minimum number of letters. For example, to find the top 10 words in King Lear, by frequency, that have at least 7 letters:

bash top-shake-words.sh lear 7

Result:

    174 gloucester
     82 goneril
     74 cornwall
     62 cordelia
     56 gentleman
     34 nothing
     30 daughter
     28 daughters
     25 brother
     23 against

Getting the data

So execute these commands to download and unzip a folder of Shakespearean plays (rendered as plaintext files):

curl -so shakespeare.zip \
  http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip
unzip shakespeare.zip

This should create a new subdirectory named shakespeare-plays-flat-text. Change into that directory and count up all the lines in all the text files:

cd shakespeare-plays-flat-text
cat *.txt | wc -l

(You should end up with 240,241 lines)

Counting up the words in a play

Given a stream of text, how do we break it up into individual words and then sort it by order of occurrence?

Try to think through the steps and look up the individual commands on your own. The answer is below, using lear.txt and 1 as hard-coded values for the variables:

cat lear.txt | \
  tr [[:upper:]] [[:lower:]] | \
  grep -oE '[[:alpha:]]{1,}' | \
  sort | uniq -c

To get just the top 10 results, sorted in reverse order, we add just two more filters:

cat lear.txt | \
  tr [[:upper:]] [[:lower:]] | \
  grep -oE '[[:alpha:]]{1,}' | \
  sort | uniq -c | \
  sort -rn | head -n 10

Creating a one-argument script

One thing at a time: let's create a script that accepts one argument: the slug/shortname for a play, e.g. lear for King Lear, romeo_juliet for Romeo and Juliet.

Using the nano text editor and open up the file named top-shake-words.sh. Re-type the code above, then alter it to read from the variable $1 instead of the hardcoded lear.txt:

cat "$1.txt" | \
  tr [[:upper:]] [[:lower:]] | \
  grep -oE '[[:alpha:]]{1,}' | \
  sort | uniq -c | \
  sort -rn | head -n 10

Quick tip: Notice that I've used double-quotes around $1.txt, that is, "$1.txt", and not '$1.txt'. When a variable reference is in single-quotes, bash will not expand it (this is sometimes referred to as string interpolation). Instead, bash will try (and fail) to open the file named, literally, $1.txt.

Executing bash top-shake-words.sh othello should result in the following:

    899 i
    793 and
    758 the
    625 to
    494 you
    472 of
    449 a
    427 my
    396 that
    359 iago

Accepting a second argument

The second argument is pretty straightforward to add. It has to modify the call to grep:

    grep -oE '[[:alpha:]]{1,}'

Modified to accept a second argument, e.g. $2

    grep -oE '[[:alpha:]]{$2,}'

However, this modification is not sufficient. Again, remember we have to use double-quotes so that $2 is properly

This time, executing bash top-shake-words.sh othello should result in the following:

    331 othello
    229 desdemona
    104 roderigo
     43 lodovico
     43 brabantio
     35 montano
     33 general
     30 handkerchief
     29 lieutenant
     27 gratiano

Making the second argument optional

The new modification to the script changes its functionality: if the user calls it with one argument, e.g.

    bash top-shake-words.sh othello

– it will no longer work, because without the second argument, the script will run this invalid regular expression:

     grep -oE '[[:alpha:]]{,}'

The solution here is to use a conditional statement. Basically, if no second argument was [passed, i.e. $2 is empty, then we want to use a value of 1; else, (i.e. $2 has a value), we use the value of the second argument.

This tutorial can't cover the details of conditional statements, which you can read more about at the TLDP Bash Guide for Beginners, so I'll provide the complete code as an example:

if [[ -z $2 ]]; then
  mval=1
else
  mval=$2
fi

cat "$1.txt" | \
  tr [[:upper:]] [[:lower:]] | \
  grep -oE "[[:alpha:]]{$mval,}" | \
  sort | uniq -c | \
  sort -rn | head -n 10

A third argument

Just for the fun of it, let's modify our script to take an optional third argument: a number which specifies the maximum length of the words to count.

Thus, to count every word that is at least 5 characters, but no more than 7 characters:

bash top-shake-words.sh othello 5 7

    331 othello
    252 cassio
    137 emilia
     98 shall
     79 would
     68 think
     67 there
     64 enter
     61 heaven
     54 night

This requires a third argument, and a modification of the regular expression to look for word boundaries. Remember that [[:alpha:]]{5,7} would also match the first 7 letters of 8-letter words, e.g.

echo 'hellacious octogons' | grep -oE '[[:alpha:]]{5,7}'

hellaci
octogon

Review the guide on basic regular expressions to refresh your memory.

To accommodate a third optional argument in the top-shake-words.sh script, we use another if/else conditional statement, this time, to modify the value passed into grep's extended-regular-expression option. Here's the complete script:

if [[ -z $2 ]]; then
  mval=1
else
  mval=$2
fi

if [[ -z $3 ]]; then
  regex="[[:alpha:]]{$mval,}"
else
  regex="\b[[:alpha:]]{$mval,$3}\b"
fi

cat "$1.txt" | \
  tr [[:upper:]] [[:lower:]] | \
  grep -oE "$regex" | \
  sort | uniq -c | \
  sort -rn | head -n 10

Auto-installer

If you remember the very first step of this process, we had to download a zip file of Shakespearean text. What happens if you email someone your top-shake-words.sh and they try to run it without having first downloaded the Shakespearean text files?

You could send them a note telling them how to download and unzip the data themselves. But as a convenience, let's design the script to automatically download the data for them.

This can easily be done by adding the commands we ran to download the data:

curl -so shakespeare.zip \
  http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip
unzip shakespeare.zip

Since unzip shakespeare.zip creates a new directory named shakespeare-plays-flat-text, we need to modify our script to read files from that subdirectory (previously, we changed into the subdirectory, but that's an unnecessary step). Here's the lines we add and change so that top-shake-words.sh downloads the data before acting on it:

curl -so shakespeare.zip \
  http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip
unzip shakespeare.zip

# ...the conditional statements

cat "./shakespeare-plays-flat-text/$1.txt" | \
# ...the other filters

Try the new script out by creating and changing into an entirely empty directory, copying the top-shake-words.sh script into it, and then running it:

mkdir -p /tmp/throwaway/foofun
cd /tmp/throwaway/foofun

bash top-shake-words.sh romeo_juliet 5 7

Not only will you get the most frequent 5-to-7 letter words, you'll find yourself with a fresh new copy of the Bard's text as text files, inside the subdirectory ./shakespeare-plays-flat-text

Considerate auto-installer

So now top-shake-words.sh will conveniently download shakespeare-plays-flat-text.zip and unzip it for the user. That's nice. But what happens if the user already ran the script once? Well, unfortunately, top-shake-words.sh, as we've modified it, will always re-download the data, even if it already exists. Try running it again to see what happens.

That's a bit annoying. And now you have a taste of the difficulties of building software that "just works" for any given user. Our problem is kind of an easy fix: We just use another conditional statement:

If the directory ./shakespeare-plays-flat-text exists, then do not attempt to re-download it. Else, download the file and unzip it to create the directory.

Think it over. Look up "how to test if directory exists using a shell script" yourself.

The fully modified script, with comments to remind you what part of the code is doing what, is below:

data_url='http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip'
data_subdir='./shakespeare-plays-flat-text'

# test to see if data has been downloaded
# if not, then download it and tell the user about it
if ! [[ -d "$data_subdir" ]]; then
  echo '-------------------'
  echo "First-time installation process..."
  echo "...Downloading from $data_url"
  curl -so shakespeare.zip $data_url
  unzip shakespeare.zip
  echo "Done installing data!"
  echo "-------------------"
fi

# test to see if the second argument, minimum number of word chars
# has been set. If not, it defaults to 1
if [[ -z $2 ]]; then
  mval=1
else
  mval=$2
fi

# test to see if the third argument, maximum number of word chars
# has been set. If not, there is no maximum word length
if [[ -z $3 ]]; then
  regex="[[:alpha:]]{$mval,}"
else
  regex="\b[[:alpha:]]{$mval,$3}\b"
fi

cat "./shakespeare-plays-flat-text/$1.txt" | \
  tr [[:upper:]] [[:lower:]] | \
  grep -oE "$regex" | \
  sort | uniq -c | \
  sort -rn | head -n 10

To test the script above, run it once. Then delete the shakespeare-plays-flat-text subdirectory. Then run the script again to see the installation process.

Wrapping up

The key takeaway here is that you've learned how to create a script that can wrap up any number and length of commands, so that re-running those commands is nothing more than a one-liner:

    bash my-script.sh

This ability to modularize your code will be profoundly helpful as you do more complicated tasks. Sometimes, you'll find yourself writing scripts that call other scripts, so that you don't have any one mega-script that is impossible to re-read and debug. In fact, you are already doing this: did you write the cat command? Or grep? No. Their functionality has been wrapped up in such a way that you just have to remember the names of their commands.

To paraphrase Peteris Krumins' paraphrase of Pablo Picasso (who might have stolen it from W.H. Davenport Adams):

Good coders code. Great coders reuse.

Remembering the Unix philosophy

This isn't a software engineering class, but you might have noticed that top-shake-words.sh seems to be unnecessarily limited. That is, it only operates from Shakespearean plays, but technically, it could do its word-counting magic from any source of text.

So why not modify it so that if a URL is passed in, it will attempt to curl the webpage down – or if it's just a local file, just to cat it – and then count the words. But then we'd have to write code to tell if something is a URL versus just a text file on the computer…among other issues that would come up.

In times like these, it's worth remembering the Unix philosophy, particularly the saying, "Do one thing and do it well." Our script does one thing fairly well: count occurrences of English words in a body of text. But it then tries to do something half-assedly, i.e. download Shakespearean text.

It would be much better if we changed the script to read from any text stream, like so:

curl http://romeo-and-juliet.com | top-any-words.sh 5 7

This underscores the part of the Unix philosophy that says: "Text is a universal interface" – in the proposed modification above, top-words.sh would not even have to worry about the existence of data files, or the Internet. It lets curl or whatever program is piping into it handle the details of getting the text. That way the top-any-words.sh script just cares about one thing: counting (English) words.

We might deal with that design issue in another lesson. For now, it's enough to remember how following the Unix philosophy can make program design much easier and worry-free.

Other resources

The class textbook, "Data Science at the Command Line" has many more details on how to design versatile, reusable scripts in Chapter 4: Creating Reusable Command-Line Tools
How to Create a First Shell Script via the Linux Information Project
Shell scripts via Software Carpentry