More analysis of trends in American baby-naming

More practice with text filters to find interesting trends in the SSA baby name data.

Due: Tuesday, January 27
Points: 5 (Extra Credit)

Because we’re relatively familiar with the U.S. Social Security Administration’s data on most popular baby names, this is more practice with Unix text utilities and shell scripting to quickly find answers to trivial questions about how Americans name their babies.

Deliverables

  • A project folder named homework/ssa-baby-name-fun

    In your compciv repo, create a subdirectory (from the command-line) named: homework/ssa-baby-name-fun

  • helper.sh to download the SSA baby data

    The helper.sh script, when executed, should download and unpack both the zip files for SSA baby names nationwide and by state.

  • lost-names.sh - find names that went out of style

    Given two years, x and y, the lost-names.sh script should return an alphabetized list of names, by gender, that appear in the nationwide data for year x but not in y.

    Sample usage:

          bash lost-names.sh 1880 2013
    

    Sample output:

          Addie,M
          Adda,F
          Adline,F
          Adolf,M
          Albertina,F
          Arvid,M
    
  • uniquely-stately.sh - Find names unique to a state

    Given a state abbreviation and a year, the uniquely-stately.sh script would return a list of names, by gender, that appear only in the given state for that year. The output list should include: the gender, the name, and the number of babies with that name for that state and year.

    Sample usage:

        bash uniquely-stately.sh IA 2013
    

    Sample output:

        M,Kinnick,16
        M,Tayten,6
        F,Kinnick,5
        M,Kysen,5
        M,Seeley,5
    
  • Hints

    Solution

    helper.sh

    mkdir -p ./data-hold/{names-by-state,names-nationwide} 
    
    cd data-hold/names-by-state && curl -O http://stash.compciv.org/ssa_baby_names/namesbystate.zip && unzip -o namesbystate.zip && rm namesbystate.zip && cd ../..
    
    cd data-hold/names-nationwide && curl -O http://stash.compciv.org/ssa_baby_names/names.zip && unzip -o names.zip && rm names.zip && cd ../..
    
    cd data-hold/names-nationwide
    sort "yob$1.txt" | cut -d ',' -f 1,2 > y1.txt
    sort "yob$2.txt" | cut -d ',' -f 1,2 > y2.txt
    grep -Fvf y2.txt y1.txt 
    rm y1.txt
    rm y2.txt
    cd ..
    cd ..
    

    stately-uniquely.sh

    state=$1
    year=$2
    
    # grabbing just the filenames of all the states except $state
    fnames=$(ls ./data-hold/names-by-state/*.TXT | grep -v $state )
    
    # Now compiling  all the names in all the OTHER states and getting just the 
    # unique combinations of name and gender, that were found in $year
    cat $fnames | grep ",$year," | cut -d ',' -f 2,4 | sort | uniq > data-hold/tmp.txt
     
    # get all the names in this $state
    # find only the rows that belong to that year
    # then do a grep -v -f of the combined names found in the previous step
    cat "./data-hold/names-by-state/$state.TXT" | \
        grep ",$year," | \
        cut -d ',' -f 2,4,5 | \
        grep -vF -f data-hold/tmp.txt
    

    lost-names.sh

    sort "data-hold/names-nationwide/yob$1.txt" | cut -d ',' -f 1,2 > "data-hold/lost-$1.txt"
    sort "data-hold/names-nationwide/yob$2.txt" | cut -d ',' -f 1,2 > "data-hold/lost-$2.txt"
    grep -Fvf "data-hold/lost-$2.txt"