Because we’re relatively familiar with the U.S. Social Security Administration’s data on most popular baby names, this is more practice with Unix text utilities and shell scripting to quickly find answers to trivial questions about how Americans name their babies.
In your compciv
repo, create a subdirectory (from the command-line) named: homework/ssa-baby-name-fun
The helper.sh
script, when executed, should download and unpack both the zip files for SSA baby names nationwide and by state.
The state zip should be downloaded into the subfolder, data-hold/names-by-state
The nationwide zip should be downloaded into the subfolder, data-hold/names-nationwide
Given two years, x
and y
, the lost-names.sh
script should return an alphabetized list of names, by gender, that appear in the nationwide data for year x
but not in y
.
Sample usage:
bash lost-names.sh 1880 2013
Sample output:
Addie,M
Adda,F
Adline,F
Adolf,M
Albertina,F
Arvid,M
Given a state abbreviation and a year, the uniquely-stately.sh
script would return a list of names, by gender, that appear only in the given state for that year. The output list should include: the gender, the name, and the number of babies with that name for that state and year.
Sample usage:
bash uniquely-stately.sh IA 2013
Sample output:
M,Kinnick,16
M,Tayten,6
F,Kinnick,5
M,Kysen,5
M,Seeley,5
The helper.sh
script is similar to the one you did in the homework to set up your Github homework folder. However, mind the different folder names (and that you're using both the nationwide and per-state data)
You should be able to write both scripts using just a combination of grep
, cut
, sort
, and uniq
.
You could use a for
loop but that's unnecessary with proper use of pipes to filter the data.
And you'll want to know how to write scripts that can read arguments from the command-line.
mkdir -p ./data-hold/{names-by-state,names-nationwide}
cd data-hold/names-by-state && curl -O http://stash.compciv.org/ssa_baby_names/namesbystate.zip && unzip -o namesbystate.zip && rm namesbystate.zip && cd ../..
cd data-hold/names-nationwide && curl -O http://stash.compciv.org/ssa_baby_names/names.zip && unzip -o names.zip && rm names.zip && cd ../..
cd data-hold/names-nationwide
sort "yob$1.txt" | cut -d ',' -f 1,2 > y1.txt
sort "yob$2.txt" | cut -d ',' -f 1,2 > y2.txt
grep -Fvf y2.txt y1.txt
rm y1.txt
rm y2.txt
cd ..
cd ..
state=$1
year=$2
# grabbing just the filenames of all the states except $state
fnames=$(ls ./data-hold/names-by-state/*.TXT | grep -v $state )
# Now compiling all the names in all the OTHER states and getting just the
# unique combinations of name and gender, that were found in $year
cat $fnames | grep ",$year," | cut -d ',' -f 2,4 | sort | uniq > data-hold/tmp.txt
# get all the names in this $state
# find only the rows that belong to that year
# then do a grep -v -f of the combined names found in the previous step
cat "./data-hold/names-by-state/$state.TXT" | \
grep ",$year," | \
cut -d ',' -f 2,4,5 | \
grep -vF -f data-hold/tmp.txt
sort "data-hold/names-nationwide/yob$1.txt" | cut -d ',' -f 1,2 > "data-hold/lost-$1.txt"
sort "data-hold/names-nationwide/yob$2.txt" | cut -d ',' -f 1,2 > "data-hold/lost-$2.txt"
grep -Fvf "data-hold/lost-$2.txt"