This is an extension to a previous tutorial, "Creating Basic Shell Scripts".
Let's move on to another more complicated toy example: we'll write a shell script named top-shake-words.sh
that will have this usage:
Given a single argument – a reference to a Shakespearean play, e.g. lear
, – the top-shake-words.sh
script returns the top 10 words by frequency of occurrence (regardless of capitalization) in that play:
# given a reference to a play, like 'lear'
bash top-shake-words.sh lear
910 the
737 and
576 to
476 of
463 you
458 my
363 that
303 in
282 not
282 king
Then design top-shake-words.sh
so that it can take a second argument: list only the words that have a certain minimum number of letters. For example, to find the top 10 words in King Lear, by frequency, that have at least 7 letters:
bash top-shake-words.sh lear 7
Result:
174 gloucester
82 goneril
74 cornwall
62 cordelia
56 gentleman
34 nothing
30 daughter
28 daughters
25 brother
23 against
So execute these commands to download and unzip a folder of Shakespearean plays (rendered as plaintext files):
curl -so shakespeare.zip \
http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip
unzip shakespeare.zip
This should create a new subdirectory named shakespeare-plays-flat-text
. Change into that directory and count up all the lines in all the text files:
cd shakespeare-plays-flat-text
cat *.txt | wc -l
(You should end up with 240,241 lines)
Given a stream of text, how do we break it up into individual words and then sort it by order of occurrence?
Try to think through the steps and look up the individual commands on your own. The answer is below, using lear.txt
and 1
as hard-coded values for the variables:
cat lear.txt | \
tr [[:upper:]] [[:lower:]] | \
grep -oE '[[:alpha:]]{1,}' | \
sort | uniq -c
To get just the top 10 results, sorted in reverse order, we add just two more filters:
cat lear.txt | \
tr [[:upper:]] [[:lower:]] | \
grep -oE '[[:alpha:]]{1,}' | \
sort | uniq -c | \
sort -rn | head -n 10
One thing at a time: let's create a script that accepts one argument: the slug/shortname for a play, e.g. lear
for King Lear, romeo_juliet
for Romeo and Juliet.
Using the nano
text editor and open up the file named top-shake-words.sh
. Re-type the code above, then alter it to read from the variable $1
instead of the hardcoded lear.txt
:
cat "$1.txt" | \
tr [[:upper:]] [[:lower:]] | \
grep -oE '[[:alpha:]]{1,}' | \
sort | uniq -c | \
sort -rn | head -n 10
Quick tip: Notice that I've used double-quotes around $1.txt
, that is, "$1.txt"
, and not '$1.txt'
. When a variable reference is in single-quotes, bash will not expand it (this is sometimes referred to as string interpolation). Instead, bash will try (and fail) to open the file named, literally, $1.txt
.
Executing bash top-shake-words.sh othello
should result in the following:
899 i
793 and
758 the
625 to
494 you
472 of
449 a
427 my
396 that
359 iago
The second argument is pretty straightforward to add. It has to modify the call to grep
:
grep -oE '[[:alpha:]]{1,}'
Modified to accept a second argument, e.g. $2
grep -oE '[[:alpha:]]{$2,}'
However, this modification is not sufficient. Again, remember we have to use double-quotes so that $2
is properly
This time, executing bash top-shake-words.sh othello
should result in the following:
331 othello
229 desdemona
104 roderigo
43 lodovico
43 brabantio
35 montano
33 general
30 handkerchief
29 lieutenant
27 gratiano
The new modification to the script changes its functionality: if the user calls it with one argument, e.g.
bash top-shake-words.sh othello
– it will no longer work, because without the second argument, the script will run this invalid regular expression:
grep -oE '[[:alpha:]]{,}'
The solution here is to use a conditional statement. Basically, if no second argument was [passed, i.e. $2
is empty, then we want to use a value of 1
; else, (i.e. $2
has a value), we use the value of the second argument.
This tutorial can't cover the details of conditional statements, which you can read more about at the TLDP Bash Guide for Beginners, so I'll provide the complete code as an example:
if [[ -z $2 ]]; then
mval=1
else
mval=$2
fi
cat "$1.txt" | \
tr [[:upper:]] [[:lower:]] | \
grep -oE "[[:alpha:]]{$mval,}" | \
sort | uniq -c | \
sort -rn | head -n 10
Just for the fun of it, let's modify our script to take an optional third argument: a number which specifies the maximum length of the words to count.
Thus, to count every word that is at least 5 characters, but no more than 7 characters:
bash top-shake-words.sh othello 5 7
331 othello
252 cassio
137 emilia
98 shall
79 would
68 think
67 there
64 enter
61 heaven
54 night
This requires a third argument, and a modification of the regular expression to look for word boundaries. Remember that [[:alpha:]]{5,7}
would also match the first 7 letters of 8-letter words, e.g.
echo 'hellacious octogons' | grep -oE '[[:alpha:]]{5,7}'
hellaci
octogon
Review the guide on basic regular expressions to refresh your memory.
To accommodate a third optional argument in the top-shake-words.sh
script, we use another if/else conditional statement, this time, to modify the value passed into grep's extended-regular-expression option. Here's the complete script:
if [[ -z $2 ]]; then
mval=1
else
mval=$2
fi
if [[ -z $3 ]]; then
regex="[[:alpha:]]{$mval,}"
else
regex="\b[[:alpha:]]{$mval,$3}\b"
fi
cat "$1.txt" | \
tr [[:upper:]] [[:lower:]] | \
grep -oE "$regex" | \
sort | uniq -c | \
sort -rn | head -n 10
If you remember the very first step of this process, we had to download a zip file of Shakespearean text. What happens if you email someone your top-shake-words.sh
and they try to run it without having first downloaded the Shakespearean text files?
You could send them a note telling them how to download and unzip the data themselves. But as a convenience, let's design the script to automatically download the data for them.
This can easily be done by adding the commands we ran to download the data:
curl -so shakespeare.zip \
http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip
unzip shakespeare.zip
Since unzip shakespeare.zip
creates a new directory named shakespeare-plays-flat-text
, we need to modify our script to read files from that subdirectory (previously, we changed into the subdirectory, but that's an unnecessary step). Here's the lines we add and change so that top-shake-words.sh
downloads the data before acting on it:
curl -so shakespeare.zip \
http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip
unzip shakespeare.zip
# ...the conditional statements
cat "./shakespeare-plays-flat-text/$1.txt" | \
# ...the other filters
Try the new script out by creating and changing into an entirely empty directory, copying the top-shake-words.sh
script into it, and then running it:
mkdir -p /tmp/throwaway/foofun
cd /tmp/throwaway/foofun
bash top-shake-words.sh romeo_juliet 5 7
Not only will you get the most frequent 5-to-7 letter words, you'll find yourself with a fresh new copy of the Bard's text as text files, inside the subdirectory ./shakespeare-plays-flat-text
So now top-shake-words.sh
will conveniently download shakespeare-plays-flat-text.zip
and unzip it for the user. That's nice. But what happens if the user already ran the script once? Well, unfortunately, top-shake-words.sh
, as we've modified it, will always re-download the data, even if it already exists. Try running it again to see what happens.
That's a bit annoying. And now you have a taste of the difficulties of building software that "just works" for any given user. Our problem is kind of an easy fix: We just use another conditional statement:
If the directory ./shakespeare-plays-flat-text exists, then do not attempt to re-download it. Else, download the file and unzip it to create the directory.
Think it over. Look up "how to test if directory exists using a shell script" yourself.
The fully modified script, with comments to remind you what part of the code is doing what, is below:
data_url='http://stash.compciv.org/scrapespeare/shakespeare-plays-flat-text.zip'
data_subdir='./shakespeare-plays-flat-text'
# test to see if data has been downloaded
# if not, then download it and tell the user about it
if ! [[ -d "$data_subdir" ]]; then
echo '-------------------'
echo "First-time installation process..."
echo "...Downloading from $data_url"
curl -so shakespeare.zip $data_url
unzip shakespeare.zip
echo "Done installing data!"
echo "-------------------"
fi
# test to see if the second argument, minimum number of word chars
# has been set. If not, it defaults to 1
if [[ -z $2 ]]; then
mval=1
else
mval=$2
fi
# test to see if the third argument, maximum number of word chars
# has been set. If not, there is no maximum word length
if [[ -z $3 ]]; then
regex="[[:alpha:]]{$mval,}"
else
regex="\b[[:alpha:]]{$mval,$3}\b"
fi
cat "./shakespeare-plays-flat-text/$1.txt" | \
tr [[:upper:]] [[:lower:]] | \
grep -oE "$regex" | \
sort | uniq -c | \
sort -rn | head -n 10
To test the script above, run it once. Then delete the shakespeare-plays-flat-text
subdirectory. Then run the script again to see the installation process.
The key takeaway here is that you've learned how to create a script that can wrap up any number and length of commands, so that re-running those commands is nothing more than a one-liner:
bash my-script.sh
This ability to modularize your code will be profoundly helpful as you do more complicated tasks. Sometimes, you'll find yourself writing scripts that call other scripts, so that you don't have any one mega-script that is impossible to re-read and debug. In fact, you are already doing this: did you write the cat
command? Or grep
? No. Their functionality has been wrapped up in such a way that you just have to remember the names of their commands.
To paraphrase Peteris Krumins' paraphrase of Pablo Picasso (who might have stolen it from W.H. Davenport Adams):
Good coders code. Great coders reuse.
This isn't a software engineering class, but you might have noticed that top-shake-words.sh
seems to be unnecessarily limited. That is, it only operates from Shakespearean plays, but technically, it could do its word-counting magic from any source of text.
So why not modify it so that if a URL is passed in, it will attempt to curl
the webpage down – or if it's just a local file, just to cat
it – and then count the words. But then we'd have to write code to tell if something is a URL versus just a text file on the computer…among other issues that would come up.
In times like these, it's worth remembering the Unix philosophy, particularly the saying, "Do one thing and do it well." Our script does one thing fairly well: count occurrences of English words in a body of text. But it then tries to do something half-assedly, i.e. download Shakespearean text.
It would be much better if we changed the script to read from any text stream, like so:
curl http://romeo-and-juliet.com | top-any-words.sh 5 7
This underscores the part of the Unix philosophy that says: "Text is a universal interface" – in the proposed modification above, top-words.sh
would not even have to worry about the existence of data files, or the Internet. It lets curl
or whatever program is piping into it handle the details of getting the text. That way the top-any-words.sh
script just cares about one thing: counting (English) words.
We might deal with that design issue in another lesson. For now, it's enough to remember how following the Unix philosophy can make program design much easier and worry-free.