Homework

More analysis of trends in American baby-naming

More practice with text filters to find interesting trends in the SSA baby name data.

Due: Tuesday, January 27

Points: 5 (Extra Credit)

Because we’re relatively familiar with the U.S. Social Security Administration’s data on most popular baby names, this is more practice with Unix text utilities and shell scripting to quickly find answers to trivial questions about how Americans name their babies.

Deliverables

A project folder named homework/ssa-baby-name-fun

In your compciv repo, create a subdirectory (from the command-line) named: homework/ssa-baby-name-fun

helper.sh to download the SSA baby data

The helper.sh script, when executed, should download and unpack both the zip files for SSA baby names nationwide and by state.

The state zip should be downloaded into the subfolder, data-hold/names-by-state
The nationwide zip should be downloaded into the subfolder, data-hold/names-nationwide

lost-names.sh - find names that went out of style

Given two years, x and y, the lost-names.sh script should return an alphabetized list of names, by gender, that appear in the nationwide data for year x but not in y.

Sample usage:

      bash lost-names.sh 1880 2013

Sample output:

      Addie,M
      Adda,F
      Adline,F
      Adolf,M
      Albertina,F
      Arvid,M

uniquely-stately.sh - Find names unique to a state

Given a state abbreviation and a year, the uniquely-stately.sh script would return a list of names, by gender, that appear only in the given state for that year. The output list should include: the gender, the name, and the number of babies with that name for that state and year.

Sample usage:

    bash uniquely-stately.sh IA 2013

Sample output:

    M,Kinnick,16
    M,Tayten,6
    F,Kinnick,5
    M,Kysen,5
    M,Seeley,5

Hints

The helper.sh script is similar to the one you did in the homework to set up your Github homework folder. However, mind the different folder names (and that you're using both the nationwide and per-state data)
You should be able to write both scripts using just a combination of grep, cut, sort, and uniq.
You could use a for loop but that's unnecessary with proper use of pipes to filter the data.
And you'll want to know how to write scripts that can read arguments from the command-line.

Solution

helper.sh

mkdir -p ./data-hold/{names-by-state,names-nationwide} 

cd data-hold/names-by-state && curl -O http://stash.compciv.org/ssa_baby_names/namesbystate.zip && unzip -o namesbystate.zip && rm namesbystate.zip && cd ../..

cd data-hold/names-nationwide && curl -O http://stash.compciv.org/ssa_baby_names/names.zip && unzip -o names.zip && rm names.zip && cd ../..

cd data-hold/names-nationwide
sort "yob$1.txt" | cut -d ',' -f 1,2 > y1.txt
sort "yob$2.txt" | cut -d ',' -f 1,2 > y2.txt
grep -Fvf y2.txt y1.txt 
rm y1.txt
rm y2.txt
cd ..
cd ..

stately-uniquely.sh

state=$1
year=$2

# grabbing just the filenames of all the states except $state
fnames=$(ls ./data-hold/names-by-state/*.TXT | grep -v $state )

# Now compiling  all the names in all the OTHER states and getting just the 
# unique combinations of name and gender, that were found in $year
cat $fnames | grep ",$year," | cut -d ',' -f 2,4 | sort | uniq > data-hold/tmp.txt
 
# get all the names in this $state
# find only the rows that belong to that year
# then do a grep -v -f of the combined names found in the previous step
cat "./data-hold/names-by-state/$state.TXT" | \
    grep ",$year," | \
    cut -d ',' -f 2,4,5 | \
    grep -vF -f data-hold/tmp.txt

lost-names.sh

sort "data-hold/names-nationwide/yob$1.txt" | cut -d ',' -f 1,2 > "data-hold/lost-$1.txt"
sort "data-hold/names-nationwide/yob$2.txt" | cut -d ',' -f 1,2 > "data-hold/lost-$2.txt"
grep -Fvf "data-hold/lost-$2.txt"