This is yet another assignment using the U.S. Social Security Administration’s data. Using the nationwide set of baby names, find the first year in which a combination of baby name and gender first appeared on the SSA’s list. For example, Pat,F
first had at least 100 babies in 1923
, while Pat,M
made its 100-baby mark in 1914
.
This is a continuation of More analysis of trends in American baby-naming, though considerably less complicated. This is more meant to be a review of grep and of regular expressions, which will continue to be important for any kind of programming you do, including data-filtering and data-visualization tasks.
If you did the More analysis of trends in American baby-naming extra-credit, this folder will already exist, and your data-hold
should already have the nationwide baby names. If you didn’t do that assignment, read the Hints section below to get the data bootstrapped.
|-compciv/
|-homework/
|--ssa-baby-name-fun/
|--first-year.sh
|--data-hold/
|--names-nationwide/
|--yob1880.txt
|--(yob1881.txt etc. etc.)
With the assumption that the SSA nationwide baby name data exists in data-hold/names-nationwide
, the first-year.sh
script will:
The output should be sorted alphabetically and look like this:
Aaden,M,2007
Aaliyah,F,1994
Aaron,M,1880
Abby,F,1954
Abel,M,1924
See below in the Hints section to see for examples of what your output should look like.
Time to use regular expressions (more) like a pro. If you are use a regex that looks like this:
'[0-9][0-9][0-9]'
– you will lose points. Stop copy-paste-repeating yourself and use the proper repetition syntax.
If you already did the extra-credit, More analysis of trends in American baby-naming, then you'll already have the data as needed.
mkdir -p ./data-hold/names-nationwide
cd data-hold/names-nationwide
curl http://stash.compciv.org/ssa_baby_names/names.zip \
-o names.zip
unzip -o names.zip
rm names.zip && cd ../..
The code above will save the data in this structure:
|--ssa-baby-name-fun/
|--first-year.sh
|--data-hold/
|--names-nationwide/
|--yob1880.txt
|--(yob1881.txt etc. etc.)
For a not-fancy-but-at-least-it-works solution, you should not need anything more than:
for
loopCheck the Unix tools page, as it contains all the tools and relevant options you'll need. And definitely brush up on regular expressions.
To satisfy the requirement of finding baby names that have had "at least 1,000 babies in a single year", you might be tempted to use math and if
-statements. You could. Or you could think of it another way: What does the number 1000 have that 999, 100, 42, and 6, do not have?
The output of first-year.sh
should have 1,592 lines
The first 25 lines of output
Aaden,M,2007
Aaliyah,F,1994
Aaron,M,1880
Abby,F,1954
Abel,M,1924
Abigail,F,1949
Abraham,M,1893
Ada,F,1880
Adalyn,F,2005
Adalynn,F,2007
Adam,M,1880
Adan,M,1969
Addison,F,1991
Addyson,F,2001
Adelaide,F,1887
Adele,F,1888
Adeline,F,1884
Adelyn,F,2003
Aden,M,1999
Adrian,M,1912
Adriana,F,1959
Adrianna,F,1975
Adrienne,F,1917
Agnes,F,1880
Aidan,M,1990
The last 25 lines of output
Willow,F,1996
Wilma,F,1896
Wilson,M,1909
Winifred,F,1883
Woodrow,M,1911
Wyatt,M,1955
Xander,M,1999
Xavier,M,1953
Ximena,F,2000
Yahir,M,2002
Yaretzi,F,2005
Yasmin,F,1974
Yesenia,F,1971
Yolanda,F,1913
Yvette,F,1917
Yvonne,F,1903
Zachary,M,1949
Zachery,M,1976
Zackary,M,1979
Zander,M,1999
Zane,M,1926
Zayden,M,2004
Zion,M,1998
Zoe,F,1952
Zoey,F,1992
names=$(cat data-hold/names-nationwide/*.txt | grep -E '[0-9]{4}' | cut -d ',' -f 1,2 | sort | uniq)
for name in $names; do
year=$(grep -lE "$name,[0-9]{3}" data-hold/names-nationwide/*.txt | \
sort | grep -oE '[0-9]{4}' | head -n 1)
echo "$name,$year"
done