-
basename Extract just the filename from a filepath
-
bc A calculator that reads from standard input
-
cat Concatenates files together
-
cd Change directory
-
cp Copy files
-
csvfix Parse CSV files
-
curl Transfer a URL
-
cut Cut out selected portions of lines
-
date Print or parse date strings
-
echo Print arguments to standard output
-
grep Print lines matching a pattern
-
head Print only the first few lines of a text stream
-
history Show the last executed commands
-
hostname Print the name of the computer you're currently on
-
iconv Converts between character sets
-
jq A command-line JSON parser
-
kill Send a signal to a running process
-
less Paginate long text streams
-
ls List directory contents
-
man Show documentation for a command
-
mkdir Make a directory
-
mv Move or rename files
-
nano Interactive text editor
-
printf Format and print data
-
ps Show a snapshot of current processes
-
pup Parse HTML from the command line
-
pwd Print the name of your working directory
-
read Read a line from standard input
-
rm Remove files
-
sed Stream editor for complex transformation of text
-
seq Print a sequence of numbers
-
sleep Suspend execution for a period of time
-
sort Sort lines of text
-
tail Print only the last lines of a text stream
-
touch Create an empty file or update its timestamp
-
tr Translate characters in a text stream
-
uniq Print only unique lines of text
-
unzip Extract files from a zip archive
-
wc Print the line, word, and byte counts of a text stream
-
wget Easy web crawling
-
whoami Print your username
-
zip Add files to a compressed archive
Standard usage
basename ./hello/there/cat.jpg
cat.jpg
Get a filename and remove its suffix with -s
basename -s '.jpg' ./hello/there/cat.jpg
cat
It works on URLs too
url="http://www.compciv.org/files/images/topics/scraping/http-cats.jpg"
fname=$(basename $url)
curl $url > $fname
Standard usage
echo '100 / 3' | bc
33
Use the -l, --mathlib
option to get floating point results
echo '100 / 3' | bc -l
33.333333333333336
Adding two (or more) files together
cat file1.txt file2.txt
line from file 1
line from file 1
line from file 2
Unnecessary (but fine, if it helps you to read pipeline from left to right) use of cat
cat onefile.txt | grep 'hi'
Add a Heredoc-style string into a file
Heredocs are helpful for working multi-line complex strings, such as raw HTML.
cat > basic.html<<'EOF'
<html>
<head>
<title>My first "Web Page"</title>
</head>
<body>
<h1>A headline</h1>
<p>Check out the
<a href="http://www.nytimes.com">New York Times</a>
</p>
</body>
</html>
EOF
Change into a directory
cd some/path
Change into home directory
cd ~
Change to parent directory
cd ..
Change into the system’s root
cd /
Change into the system’s /tmp
directory
cd /tmp
Standard usage
cp source_file.txt new_file.txt
Force copy: overwrite files without prompting
cp -f source_file.txt existing_file.txt
Make a copy of a directory with the -r
option
cp -r some_dir/ new_dir/
Copy something into your home directory
cp something.txt ~
Copy all files with a .txt
extension into a sub-directory
cp *.txt some_dir
This utility provides the ability to parse text files in which the values/columns are delimited by commas, or a delimiter of your choice. Because of the possibility that CSV files contain multi-line data (and, oh, the lack of a standard that will foil even the most skilled greppers), it is recommended that you use CSVFix when dealing with delimited-text data.
The list of subcommands is long; if you need to do something specific, check the CSVFix docs and you’ll probably find what you need.
To install on corn.stanford.edu, after having set your PATH to include ~/compciv_bin:
wget https://bitbucket.org/neilb/csvfix/get/version-1.6.zip
unzip version-1.6.zip && rm version-1.6.zip
cd neilb-csvfix-e804a794d175
make lin
cp ./csvfix/bin/csvfix ~/bin_compciv/
For the purpose of some of the examples, example.csv
contains the following:
Name,Quantity,Cost
Apple,35,2.00
Orange,67,1.95
Durian,9,12.00
Use the echo
subcommand to print the CSV in a standard format to stdout
csvfix echo example.csv
"Name","Quantity","Cost"
"Apple","35","2.00"
"Orange","67","1.95"
"Durian","9","12.00"
Use the -osep
operator to change the delimiter of CSV data when printing to stdout
csvfix echo -osep '|' example.csv
"Name"|"Quantity"|"Cost"
"Apple"|"35"|"2.00"
"Orange"|"67"|"1.95"
"Durian"|"9"|"12.00"
Select and rearrange order of the columns with the order
subcommand
csvfix order -n 3,2,1 example.csv
"Cost","Quantity","Name"
"2.00","35","Apple"
"1.95","67","Orange"
"12.00","9","Durian"
Select, rearrange order by column name with order -fn
csvfix order -fn Cost,Name example.csv
"Cost","Name"
"2.00","Apple"
"1.95","Orange"
"12.00","Durian"
Sort the data by a column with the sort
subcommand and using the -rh
option to include the header
csvfix sort -rh -f 1 example.csv
Name,Quantity,Cost
"Apple","35","2.00"
"Durian","9","12.00"
"Orange","67","1.95"
Force csvfix to only double-quote fields when necessary with -smq
option
csvfix -smq order -f 3,2,1 example.csv
Cost,Quantity,Name
2.00,35,Apple
1.95,67,Orange
12.00,9,Durian
Force csvfix to use a specific delimiter with -osep
for the output
csvfix -osep '@' order -f 3,2,1 example.csv
"Cost"@"Quantity"@"Name"
"2.00"@"35"@"Apple"
"1.95"@"67"@"Orange"
"12.00"@"9"@"Durian"
Sort the 3rd column, in descending numerical order
csvfix sort -rh -f 3:DN example.csv
Name,Quantity,Cost
"Durian","9","12.00"
"Apple","35","2.00"
"Orange","67","1.95"
Use printf to customize the output of the field values
csvfix printf -fmt "There are %s %s %f" example.csv
There are Name Quantity 0.000000
There are Apple 35 2.000000
There are Orange 67 1.950000
There are Durian 9 12.000000
Switch up the order of columns for printf
with -f
option
csvfix printf -f 2,1,3 -fmt "There are %s %ss and they cost %f each" example.csv
There are Quantity Names and they cost 0.000000 each
There are 35 Apples and they cost 2.000000 each
There are 67 Oranges and they cost 1.950000 each
There are 9 Durians and they cost 12.000000 each
Use the ifn
option to remove the header from the output
csvfix printf -ifn -f 2,1,3 -fmt "There are %s %ss and they cost %f each" example.csv
There are 35 Apples and they cost 2.000000 each
There are 67 Oranges and they cost 1.950000 each
There are 9 Durians and they cost 12.000000 each
This nearly-ubiquitous tool makes it possible to interact with Web sites and APIs. Check out its manual for its many options.
Download and print to standard output
curl http://www.example.com
Download and save to specified file name with -o, –output
curl http://www.example.com -o somefile.txt
Suppress status indicator and error messages with -s, --silent
curl http://www.example.com -s
Automatically follow redirects with -L
curl http://t.co/d -L
Fetch only the headers with --head, -I
curl http://t.co/d -I
Download and save to the basename of a URL with -O
This handy option will create a filename using the basename of a URL, i.e. the last segment of the URL path
curl http://www.example.com/stuff.zip -O
Specify a delimiter with -d
and which fields to show with -f
echo A,B,C,D,E | cut -d ',' -f 3,4
C,D
Cut out everything except the nth character with -c [n]
echo 'Hello world' | cut -c 7
w
Cut out everything except the range x to y with -c [x-y]
echo 'Hello world' | cut -c 3-7
ello w
Cut out everything before the nth character -c [n-]
echo 'Hello world' | cut -c 7-
world
Cut out everything after the nth character -c [-n]
echo 'Hello world' | cut -c -7
Hello w
Standalone usage, display the date now
date
Sat Jan 24 10:52:28 PST 2015
Display the date by parsing a given string with -d, --date=
date -d '2013-01-03'
Thu Jan 3 00:00:00 PST 2013
Parse a relatively human-friendly date string
date -d 'Feb 9 1913'
Sun Feb 9 00:00:00 PST 1913
Format the current date as YYYY-MM-DD
date +%Y-%m-%d
2015-02-06
Format the output as YYYY-MM-DD
date -d 'May 15, 1974' +%Y-%m-%d
1974-05-15
Format the output as YYYY-MM-DD HH:MM:SS
date -d 'May 15, 1974 9:32 PM' '+%Y-%m-%d %H:%M:%S'
1974-05-15 21:32:00
Use -I, --iso-8601
as a shortcut for standard ISO YYYY-MM-DD format
date -d 'Sept 25, 2014 3:52:11 PM' -I
2014-09-25
Specify precision with -I[precision]
date -d 'Sept 25, 2014 3:52:11 PM' -Iseconds
2014-09-25T15:52:11-0700
Print “something” to screen
echo something
something
Print a variable’s value to stdout
something='fun times'
echo $something
fun times
Print “something” into a pipe
echo something | tr '[:lower:]' '[:upper:]'
SOMETHING
Quickie concatenation of strings
a=apples
b=bongos
echo "$a AND $b"
apples AND bongos
This 40-year-old tool is one of the most famous and ubiquitous Unix programs, and perhaps the most commonly-used for searching for text.
Printing matching lines in a file
grep 'hello' file1.txt
hello world
say hello
Grepping multiple files, showing file names with the match
grep 'hello' file1.txt file2.txt
file1.txt:hello world
file1.txt:say hello
file2.txt:just a hello
Reading from standard input removes file descriptors
cat file1.txt file2.txt | grep 'hello'
hello world
say hello
just a hello
Case insensitive search with -i
grep -i 'HELLO' file1.txt
hello world
say hello
Showing non-matching lines with -v
grep -v 'hello' file1.txt
bye world
say bye
Using extended regular expressions with -E
grep -E '[0-9]{5}' file1.txt
Beverly Hills 90210
Printing only the match, not the entire line with -o
echo 'Hello world' | grep -o 'world'
world
Printing just the match made by a regular expression pattern (5 or more alphanumerical characters)
cat file1.txt | grep -oE '[[:alnum:]]{5,}'
hello
world
hello
world
Beverly
Hills
90210
Show the x lines before a match with -B x
grep -B 1 'Beverly' file1.txt
say bye
Beverly Hills 90210
Show the y lines after a match with -A y
grep -A 1 'hello world' file1.txt
hello world
say hello
Grep for a series of strings that are contained in a file with -f
grep -f things.txt file1.txt
Grep faster when you don’t need regular expressions with -F
grep -F 'word' file.txt
When grepping a list of files (not stdin), use -l
to list all files that match the given term at least once.
grep -l 'word' *.txt
When grepping a list of files (not stdin), use -L
to list all files that don’t contain the given term
grep -L 'word' *.txt
Print only the first x lines with -n [x]
cat *.txt | head -n 5
Read from a file instead of standard input
head -n 5 file1.txt
Print all the lines until the 5th-to-last-line, with -n [-x]
head -n -5 file1.txt
Standard usage
history
Show past commands that involved `cat’
history | grep cat
Show just the most recent 10 commands
history | tail -n 5
Remove leading line numbers (as long as history is under 99,999 commands)
history | cut -c 8-
Standard usage
hostname
corn30.stanford.edu
For our purposes, iconv can be used to bypass the issues that arise from dealing with textual-data with unexpected character encodings. For example, emojis.
For more information, read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Attempt a translation of non-ASCII characters to ASCII
This is useful for converting accented characters, such as é
and ô
to their non-accented equivalents.
echo Béyôncæ | iconv -t ASCII//TRANSLIT
B'ey^oncae
Ignore all non-ASCII (i.e. standard American-English) characters
This command will sometimes give you an error message. If so, refer to the usage below of iconv
cat somefile.txt | iconv -t ASCII//IGNORE
Force the conversion of UTF-8 characters to ASCII
cat somefile.txt | iconv -c -f utf-8 -t ascii
This is a tool not part of standard Linux distributions but is extremely handy for working with JSON data.
jq has its own parsing language and methods, both for extracting data and for outputting new data structures.
The jq manual is the most comprehensive reference for how jq works, but you can refer to this basic tutorial for the basic concepts.
Simply parse and pretty-print
echo '{"name": "Dan"}' | jq '.'
{
"name": "Dan"
}
Select an object’s attribute
echo '{"name": "Dan"}' | jq '.name'
"Dan"
Select multiple attributes
echo '{"name": "Dan", "age": 45}' | jq '.name, .age'
"Dan"
45
Print raw-output with -r, --raw-output
echo '{"name": "Dan", "age": 45}' | jq -r '.name, .age'
Dan
45
Select an element from an array
echo '["a", "b", "c"]' | jq '.[1]'
"b"
Select attributes from an array of objects
echo '[{"name": "Dan", "age": 42}, {"name": "Bob", "age": 55}]' |
jq '.[] | .name'
"Dan"
"Bob"
Terminate a process with a given PID of 1234 (use ps aux
to find PID)
kill 1234
Terminate all processes that you are allowed to terminate
kill -9 -1
Show a text stream one page at a time
cat *.txt | less
Default listing of files
ls
List all files, including hidden files with -a, --all
ls -a
Show a long list with file attributes with -l
ls -l
Basic usage
man cat
Make a single directory
mkdir my_sub_dir
Make multiple directories
mkdir apples oranges pears
Make a directory and all its parent directories with -p
mkdir -p a/path/to/a/new/subdir
Make a subdirectory inside your home directory
mkdir ~/new_dir
Make a subdirectory inside /tmp
mkdir /tmp/new_dir
Rename a file
mv old_name.txt new_name.txt
Rename/move a file even if new name exists with -f
mv -f old_name.txt new_name.txt
Ask before overwriting an existing file with -i
mv -i old_name.txt new_name.txt
Move something into your home directory
mv somefile ~
Move all files with a .txt
extension into a sub-directory
mv *.txt some_dir
Open (or create) a file and enter interactive-editing mode
nano file.txt
The printf
command is like echo
, just much more powerful and versatile. The Bash Hackers Wiki has a nice page on it.
With printf
, you pass in at least two arguments:
- A string containing a sort of template for text, with special syntax for placeholders.
- A string (or several strings) that are then inserted into the placeholders of the first argument.
There are a bewildering array of syntax placeholders. The examples will try to cover the basics.
Basic usage
By default, printf
will not print a newline character at the end, causing the output to butt up against the prompt.
Like this: My name is Danuser@host:~$
printf 'My name is %s' 'Dan'
Print a new line at the end with ‘\n’
The stands for ‘new line’
printf 'My name is %s \n' 'Dan'
Work with multiple arguments
printf 'My name is %s %s. \nI am %s.\n' 'Dan' 'Man' 'happy'
My name is Dan Man.
I am happy.
Printing out an HTML string
printf '
<h1>Hello %s</h1>
<p>
<a href="%s">%s</a>
</p> \n' 'Stranger' 'http://www.thestranger.com/' 'A news site'
<h1>Hello Stranger</h1>
<p>
<a href="http://www.thestranger.com/">A news site</a>
</p>
Using a Heredoc-style string in a variable
See the example for the read command for more information on heredocs
Note: if you want to preserve the newlines in some_html
, you have to double-quote it, i.e. printf "$some_html"
read -r -d '' some_html <<'EOF'
<h1>Hello %s</h1>
<p>Here is a kitten:</p>
<img src="http://placekitten.com/g/%s/%s">
\n
EOF
printf $some_html 'Cat Lover' 500 300
<h1>Hello Cat Lover</h1>
<p>Here is a kitten:</p>
<img src="http://placekitten.com/g/500/300">
List all processes belonging to the current user and session
ps
PID TTY TIME CMD
1434 pts/65 00:00:00 sleep
1532 pts/65 00:00:00 ps
25247 pts/65 00:00:00 bash
List all processes running on the system
ps aux
List all of your processes by filtering for your user ID (this is what you most frequently want to do)
ps aux | grep $(whoami)
The pup tool is inspired by the jq JSON-parsing tool, but is used for parsing HTML with HTML/CSS selectors.
curl www.example.com | pup 'a'
<a href="http://www.iana.org/domains/example">
More information...
</a>
curl www.example.com | pup 'a attr{href}'
http://www.iana.org/domains/example
curl www.example.com | pup 'a text{}'
More information...
(When inside your own home directory)
pwd
/afs/.ir/users/y/o/your_home
The read
command is often used to handle reading text streams line-by-line – which is not something that some_var=$(cat some.txt)
will do by default.
It’s especially helpful in combination with a while
loop and for assigning Heredocs, i.e. multi-line strings that are too complex to delimit with quotation marks, to variables.
For the most part, we want to use the -r
option, which prevents backslashes from doing their normal thing of escaping characters.
Useful links:
- man page for read
- GNU reference for heredocs
- StackOverflow: How to assign a heredoc value to a variable in Bash?
- TLDP: Here Documents
For the examples below, assume example.txt
contains:
README.txt
42
Documents and Settings
index.html
Dogs and Cats.html
Read each line from a file and pass it into a while
loop
while read -r x; do
echo "Opening...$x"
done < example.txt
Opening...README.txt
Opening...42
Opening...Documents and Settings
Opening...index.html
Opening...Dogs and Cats.html
Read each line from a command and pipe into a while
loop
curl -s http://www.example.com | while read -r some_line; do
echo "This is a line: $some_line"
done
This is a line: <!doctype html>
This is a line: <html>
This is a line: <head>
This is a line: <title>Example Domain</title>
Read each line from a command and pass it into a while
loop, right to left
To read the output from a command, wrap it up between <(
and )
(as opposed to $(
and )
)
while read -r x; do
echo "Opening...$x"
done < <(cat example.txt | grep 'html')
Opening...index.html
Opening...Dogs and Cats.html
Save a multi-line Heredoc into a variable and do not interpret special Bash symbols
This will be the most common pattern we follow when creating HTML templates within Bash.
Heredocs make it easy to describe a multi-line string without worrying about whether you’ve used the right number of quote marks.
This particular example is derived from this excellent StackOverflow Q&A.
This example, with the use of 'EOF'
, prevents things like $
from being interpreted by Bash.
The use of the option -d ''
tells read
to keep on reading even after the first newline
Basically, see the read -r -d ''
as the boilerplate to memorize.
read -r -d '' some_variable <<'EOF'
<html>
<head>
<title>My first "Web Page"</title>
</head>
<body>
<h1>A headline</h1>
<p>Check out the
<a href="http://www.nytimes.com">New York Times</a>
</p>
</body>
</html>
EOF
Remove a file
rm somefile.txt
Remove all the files in the current directory
rm *
Remove all the files in the current directory but ask for confirmation with -i
rm -i *
Remove a file and do not ask for confirmation or show errors with -f
rm -f somefile.txt
Remove a file even if it is an empty directory with -d
rm -d somedir
If the given filename is a directory, remove it and everything inside of it with -r
rm -r somedir
Wipe out your computer (i.e. making a typo while doing rm -rf
is very bad)
rm -rf /
sed is a very powerful program that basically has its own language, and thus has books and websites devoted to it.
For our purposes, we can focus solely on its substitution command (Bruce Barnett describes it as “The esssential command”), which allows us to transform text with far more power than we can with just tr.
Basic substitution using the s
subcommand
echo 'hello world' | sed s/hello/bye/
bye world
Repeat the substitution for every match with the g
flag
echo 'hello world bye world' | sed s/world/people/g
hello people bye people
Make matches based on extended regular expressions with -E
option
echo 'Beverly Hills 90210' | sed -E s/[0-9]{3}/q/
Beverly Hills q10
An example of regex capturing groups and backreferences
"echo 'Beverly Hills 90210' | sed -E 's/([0-9]+)/I love \1 a lot/'"
Beverly Hills I love 90210 a lot
Print numbers 1 to 3
seq 1 5
1
2
3
4
5
Sleep for 10 seconds
sleep 10
Sleep for 5 days (only in GNU Unix, not OSX)
sleep 5d
Sort in ascending alphabetical order
sort lines.txt
100
9
A
a
b
Sort in reverse order with -r
sort -r lines.txt
b
a
A
9
100
Sort numbers based on numerical value with -n
sort -n lines.txt
A
a
b
9
100
Sort lines based on a column q
with -k [q]
and a delimiter f
with -t [f]
sort -k 3 -t ',' lines.csv
C,D,Y
A,B,Z
Print only the last x lines with -n [x]
cat *.txt | tail -n 5
Read from a file instead of standard input
tail -n 5 file1.txt
Skip the first line in a file with -n [+2]
tail -n +2 file1.txt
Update file’s accessed/modified time, or create it if it doesn’t exist
touch somefile.txt
Replace one character for another
echo Hello world | tr 'o' 'a'
Hella warld
Replace multiple characters
echo Hello world | tr 'lo' 'xo'
Hexxa warxd
Normalize all whitespace characters (including newlines) to spaces
txt="Hello,
world"
echo $txt | tr '[:space:]' ' '
Hello, world
Delete a character, such as a space character, with -d
echo Hello world | tr -d ' '
Helloworld
Translate lower-case characters to upper-case using character classes
echo Hello world | tr [:lower:] [:upper:]
HELLO WORLD
Remove all punctuation
echo 'Hello, world!' | tr -d '[:punct:]'
Hello world
Print just unique lines, but only if input is sorted
uniq somefile.txt
oranges
apples
oranges
kiwis
apples
Used in conjunction with sort
sort somefile.txt | uniq
apples
kiwis
oranges
Print unique values and frequency of occurrence with -c
option
sort somefile.txt | uniq -c
2 apples
1 kiwis
3 oranges
Basic unzipping
unzip some.zip
Use -o
option to overwrite existing files without prompting user
unzip -o some.zip
Extract files and pipe their contents to stdout with -p
option
unzip -p some.zip
Extract only specific files and pipe their contents into a new file
unzip -p stuff.zip 14.txt 42.txt > file.txt
Print line, word, and character count
wc somefile.txt
6 8 55 somefile.txt
Print just the line count with -l
wc -l somefile.txt
6 somefile.txt
Print just the word count with -w
wc -w somefile.txt
8 somefile.txt
Print just the character count with -c
wc -c somefile.txt
55 somefile.txt
Count the lines from standard input to avoid showing filename
cat somefile.txt | wc -l
6
Like curl, wget can be used to download individual files from the Web. However, it contains a suite of features geared towards batch downloads, i.e. web crawling. wget was recently known as being a Low-Cost Tool to Best [the] N.S.A..
And similar to curl, wget has a mountain of documentation worth reading.
Here are some examples from the official docs. I also like The Geek Stuff’s list of wget examples
Download a single file and save to a default filename
Unlike curl
, wget does not send downloaded content to stdout by default. Instead, it derives a base filename to save to the current working directory.
For example, wget en.wikipedia.org/wiki/Hello
will save to a file named Hello
. If the target is a directory (i.e. with a trailing slash, e.g. wget en.wikipedia.org/wiki/
), it will save to index.html
Also by default: if the default filename already exists, wget will create a new, numbered variation, e.g. index.html.1
wget www.example.com
--2015-06-18 05:09:00-- http://www.example.com/
Resolving www.example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to www.example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: ‘index.html’
100%[======================================>] 1,270 --.-K/s in 0.002s
2015-06-18 05:09:00 (743 KB/s) - ‘index.html’ saved [1270/1270]
Redirect to stdout
wget -O - www.example.com
[content of the webpage]
100%[======================================>] 1,270 --.-K/s in 0.001s
2015-06-13 04:52:45 (1.53 MB/s) - written to stdout [1270/1270]
Download files only if newer than existing files
With this option, wget will set the downloaded file’s timestamp based on the web server’s Last-Modified
header.
On subsequent downloads using -N
, wget will fetch a file only if it is newer than the existing file.
Read the full docs at gnu.org: Time-Stamping Usage
wget -N www.example.com
--2015-06-18 05:06:39-- http://www.example.com/
Resolving www.example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to www.example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Server file no newer than local file ‘index.html’ -- not retrieving.
Recursively download links
This is where wget starts to get fun – and dangerous. The recursive option will cause wget to download not just the target page, but all URLs linked to from that page. This includes URLs of things like images and stylesheets
By default, it will save all of the files into a directory named after the site domain.
It should go without saying that this can be a massive operation if you aren’t careful.
From the documentation on Recursive Download:
Recursive retrieval of HTTP and HTML/CSS content is breadth-first. This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on
wget -r www.stanford.edu
[a wall of output showing that every file linked to from the Stanford homepage has been downloaded]
2015-06-18 05:17:43 (4.26 MB/s) - ‘www.stanford.edu/about/history/images/hero-seq.jpg’ saved [520408/520408]
FINISHED --2015-06-18 05:17:43--
Total wall clock time: 12s
Downloaded: 147 files, 9.2M in 4.6s (2.00 MB/s)
Specify the number of layers (i.e. the depth) for a recursive crawl.
By default, a recursive crawl with wget will go 5 layers deep, i.e it will download all the links from the first page. Then it will visit each of those links and download their links, and so on, five layers deep.
Setting this value to 1
will only download URLs linked from in the target page. Setting it to 0
is shorthand for an infinite number of layers to crawl. Be careful.
wget -r -l 1 www.stanford.edu
[long list of files downloaded]
FINISHED --2015-06-18 05:26:43--
Total wall clock time: 2.2s
Downloaded: 42 files, 1.0M in 0.4s (2.73 MB/s)
Download only files with a specified extension
Use in conjunction with -r
. Extremely helpful if a webpage contains links to a set binary files you want to collect, without collecting everything else, such as links to other webpages.
wget -r -A .jpg www.stanford.edu
[a wall of output for every jpg on the homepage]
2015-06-18 05:20:34 (4.49 MB/s) - ‘www.stanford.edu/about/history/images/hero-seq.jpg’ saved [520408/520408]
FINISHED --2015-06-18 05:20:34--
Total wall clock time: 3.4s
Downloaded: 32 files, 4.7M in 2.1s (2.26 MB/s)
Mirror an entire site
Again, be careful. From the docs:
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to
-r -N -l inf --no-remove-listing
.
wget -m www.example.com
Snapshot a single page
This is a variation I use when I just want to preserve a single page and all of its visual elements, similar to how sites like archive.is work.
See a full description of the flags and options in this gist.
wget -E -H -k -K -nd -N -p -P /tmp/wikipedia https://en.wikipedia.org/wiki/Main_Page
[wall of output of downloaded files]
FINISHED --2015-06-18 05:48:04--
Total wall clock time: 2.7s
Downloaded: 20 files, 154K in 0.3s (563 KB/s)
Converting /tmp/wikipedia/Main_Page.html... 28-310
Converted 1 files in 0.003 seconds.
Mirror a subdirectory
Use of the –no-parent flag prevents going higher than the specified sub directory
wget -m -P -e robots=off --no-parent http://www.example.com/whatsup
Standard usage
whoami
your_sunet_id
Add all the .txt
files in current directory to a zip archive
zip alltext.zip *.txt