Homework

Firsts in American baby-naming

Even more practice with text filters, this time to find when baby names first became known.

Due: Tuesday, February 10

Points: 3 (Extra Credit)

This is yet another assignment using the U.S. Social Security Administration’s data. Using the nationwide set of baby names, find the first year in which a combination of baby name and gender first appeared on the SSA’s list. For example, Pat,F first had at least 100 babies in 1923, while Pat,M made its 100-baby mark in 1914.

This is a continuation of More analysis of trends in American baby-naming, though considerably less complicated. This is more meant to be a review of grep and of regular expressions, which will continue to be important for any kind of programming you do, including data-filtering and data-visualization tasks.

Deliverables

A folder named `homework/ssa-baby-name-fun`

If you did the More analysis of trends in American baby-naming extra-credit, this folder will already exist, and your data-hold should already have the nationwide baby names. If you didn’t do that assignment, read the Hints section below to get the data bootstrapped.

  |-compciv/
    |-homework/
       |--ssa-baby-name-fun/
          |--first-year.sh
          |--data-hold/
             |--names-nationwide/
                |--yob1880.txt
                |--(yob1881.txt etc. etc.)

The `first-year.sh` script

With the assumption that the SSA nationwide baby name data exists in data-hold/names-nationwide, the first-year.sh script will:

Find every combination of name and gender that has had at least 1,000 babies in a single year.
For each combination above, the first year in which that combo had at least 100 babies.

The output should be sorted alphabetically and look like this:

              Aaden,M,2007
              Aaliyah,F,1994
              Aaron,M,1880
              Abby,F,1954
              Abel,M,1924

See below in the Hints section to see for examples of what your output should look like.

No repetition in your regexes

Time to use regular expressions (more) like a pro. If you are use a regex that looks like this:

    '[0-9][0-9][0-9]'

– you will lose points. Stop copy-paste-repeating yourself and use the proper repetition syntax.

Hints

Bootstrapping the data

If you already did the extra-credit, More analysis of trends in American baby-naming, then you'll already have the data as needed.

mkdir -p ./data-hold/names-nationwide 
cd data-hold/names-nationwide

curl http://stash.compciv.org/ssa_baby_names/names.zip \
  -o names.zip

unzip -o names.zip
rm names.zip && cd ../..

The code above will save the data in this structure:

   |--ssa-baby-name-fun/
      |--first-year.sh
      |--data-hold/
         |--names-nationwide/
            |--yob1880.txt
            |--(yob1881.txt etc. etc.)

Your toolset

For a not-fancy-but-at-least-it-works solution, you should not need anything more than:

a single for loop
a variable, if necessary
cut
cat
echo
grep
head
sort
uniq

Check the Unix tools page, as it contains all the tools and relevant options you'll need. And definitely brush up on regular expressions.

Thinking in numerical patterns

To satisfy the requirement of finding baby names that have had "at least 1,000 babies in a single year", you might be tempted to use math and if-statements. You could. Or you could think of it another way: What does the number 1000 have that 999, 100, 42, and 6, do not have?

Sample output

The output of first-year.sh should have 1,592 lines

The first 25 lines of output

Aaden,M,2007
Aaliyah,F,1994
Aaron,M,1880
Abby,F,1954
Abel,M,1924
Abigail,F,1949
Abraham,M,1893
Ada,F,1880
Adalyn,F,2005
Adalynn,F,2007
Adam,M,1880
Adan,M,1969
Addison,F,1991
Addyson,F,2001
Adelaide,F,1887
Adele,F,1888
Adeline,F,1884
Adelyn,F,2003
Aden,M,1999
Adrian,M,1912
Adriana,F,1959
Adrianna,F,1975
Adrienne,F,1917
Agnes,F,1880
Aidan,M,1990

The last 25 lines of output

Willow,F,1996
Wilma,F,1896
Wilson,M,1909
Winifred,F,1883
Woodrow,M,1911
Wyatt,M,1955
Xander,M,1999
Xavier,M,1953
Ximena,F,2000
Yahir,M,2002
Yaretzi,F,2005
Yasmin,F,1974
Yesenia,F,1971
Yolanda,F,1913
Yvette,F,1917
Yvonne,F,1903
Zachary,M,1949
Zachery,M,1976
Zackary,M,1979
Zander,M,1999
Zane,M,1926
Zayden,M,2004
Zion,M,1998
Zoe,F,1952
Zoey,F,1992

Solution

names=$(cat data-hold/names-nationwide/*.txt | grep -E '[0-9]{4}' | cut -d ',' -f 1,2 | sort | uniq)

for name in $names; do 
    year=$(grep -lE "$name,[0-9]{3}" data-hold/names-nationwide/*.txt | \
    sort | grep -oE '[0-9]{4}' | head -n 1)

    echo "$name,$year"
done