Please read the following instructions before starting the exercises:
-
Exercise material setup: download the exercises.zip archive file to your local computer and unzip it. This will unpack a directory named
exercises
, with all the data needed for the course exercises. -
Exercise solutions: all exercises and Additional Tasks section have their solution embedded in this document. The solutions are hidden by default, but you can reveal them by clicking on the drop-down menu, like this one:
Exercise solution (click me)
✨ This reveals the answer ✨We encourage you to not look at the solutions too quickly, and try to solve the exercises without it. Remember that you can always ask the course teachers for help.
-
Additional Tasks: at the end of each exercise, you will find a section named Additional Tasks. These sections contain tasks to complete if you have the time and after having completed the main exercise. The Additional Tasks sections will not be corrected in class, but their solution is given in this document.
Objective: get familiar with navigating the directory tree and listing the content of directories.
-
Print your current working directory with the
pwd
command. This will show you where you currently are in the directory tree. -
Navigate to the
exercises/
directory (the one you unpacked from the zip archive file), and then enter theexercise_1
subdirectory. -
Try to run the commands
cd .
andcd ..
What happens? What does.
and..
stand for? -
List the content of the
exercise_2/
directory withls
,ls -l
,ls -lh
, andls -lha
.- Question: what do the
-l
,-h
and-a
options do? - 🎯
Hint: you can use
man ls
to display the help for thels
command. To exit the help, simply typeq
on your keyboard. - ✨
Notes:
- One-letter options can be grouped together, so
ls -lha
is the same asls -l -h -a
. - Some options have both a "short" and a "long" form. E.g.
ls -ah
is the short form forls --all --human-readable
.
- One-letter options can be grouped together, so
- Question: what do the
-
List the content of the directory in chronological order (oldest file first) and in reverse chronological order (newest file first).
🔥
Tip: a very handy functionality that the shell provides is the ability to
auto-complete file/directory names. You simply have to start typing the
start of a file/directory name, and then press on TAB on your keyboard. The
shell will autocomplete (as much as possible) the file name. You can try this
functionality to autocomplete the name of the file
a_regular_file_with_a_really_long_name.md
.
Exercise solution
-
Printing the current working directory:
pwd
-
Navigate to
exercises/exercise_1
:cd /path/to/directory/exercises ls -l cd exercise_1 pwd ls -l
✨ changing directory to
exercise_1
can of course also be done in a single command:cd /path/to/directory/exercises/exercise_1
-
The
.
symbol is a shortcut for the current directory. So runningcd .
has no effect since it simply changes to the same directory we are already in. The.
shortcut is useful in some situations. E.g. if you want to copy a file to the current directory you can docp /file/to/copy .
, or you can run an executable located in the current directory with./run_me.sh
.The
..
symbol is a shortcut to the parent directory. These shortcuts can be combined, so e.g.cd ../..
will go up to levels in the directory tree. -
Listing the content of the
exercise_1/
directory with differentls
options. The effect of the different options is described in the comments of the code block.ls # Prints the names of files and directories ls -l # List content of the subdirectory in "long listing" format. This # provides additional details for each file/directory, such as # its permissions, its size and its last modified date. ls -lh # Adding the "-h" option displays file sizes in "human readable" # format. The size of files are shown in kB, MB, GB, instead of # their size in bytes (octets). ls -lha # Adding the "-a" option additionally displays hidden files and # directories. These are files/directories whose name starts with # a dot ".". # Hidden files are often used to store program configurations.
-
List the content of the directory in chronological and reverse chronological order.
ls -lht # The "-t" option sorts by time, newest file first. ls -lhtr # The "-r" option reverses the order of sorting.
Some other useful
ls
options and shortcuts:ls -a/--all # Also show hidden files. ls -R # --recursive, list subdirectories recursively. cd . # Does nothing, we stay in the same directory. cd .. # Go to parent directory. cd / # Go to root directory. cd ~ # Go to user's home directory, on Linux: /home/<user name>. cd - # Go back to the previous directory. cd # With no argument, cd brings you back to your home directory.
-
Try the
cd ~
andcd -
shortcuts. What do they do? -
Create an alias named
ll
that runs the following command:ls -lh --group-directories-first --color=auto
.✨ Notes:
- On some Linux system, an
ll
alias may already exist. - To list your currently defined aliases, you can type
alias
to list them all, oralias <name of alias>
to list a specific one (e.g.alias ll
). - Aliases are only valid within your current shell session. To make aliases
permanent, they must be defined inside a configuration file, such as
~/.bashrc
, so that they get loaded each time a new shell is spawned. - To remove an alias we use
unalias <alias name>
, or we remove it from the config file where it is defined.
- On some Linux system, an
-
Compute the size of a directory. To display the size of a directory, the command
du -sh <directory>
can be used. Try in on the directories found inexercise_1
. -
Let's look at a detail of how the bash shell displays file sizes.
Go into the directory
a_directory
and list its content using the following commands - look at how file size is indicated:ls -l
: lists the file size in bytes/octets.ls -lh
(you can also use your newll
alias!): lists the file size in a more readable format, using thek
,M
,G
, ... unit abbreviations forkB
(kilobyte),MB
(megabyte),GB
, (gigabyte) etc.
✨ Note: in everyday language, the term kilobyte (abbreviated
kB
) is used for talking interchangeably about either 1000 bytes or 1024 bytes, because they represent almost the same quantity of bytes. If we really wanted to be precise, the proper name for a unit of 1024 bytes is a kibibyteKiB
, while a kilobyte designates 1000 bytes. Similarly, a megabyte is 1'000'0000 bytes, and a mebibyte is 1024^2 bytes (same with gigabyte vs. gibibytes, terabyte vs.tebibyte, etc.).
Additional tasks solution
-
cd ~
andcd -
shortcuts:~
is a shortcut for the "home directory", and thereforecd ~
is a shortcut to change directory to your home directory.-
is a shortcut to change to the previous working directory. It is handy if you want to return to a directory you were in just previously.
-
Create an
ll
alias:# Create a new "ll" alias: alias ll='ls -lh --group-directories-first --color=auto'
Here are some some more useful commands for aliases:
alias # Lists the currently defined aliases. unalias ll # Removes the alias from the current shell session. type ll # The "type" command tells if a command is an alias, or shows # the path to the binary file. # -> ll is aliased to `ls -lh --group-directories-first --color=auto' type bash # -> bash is /usr/bin/bash
-
Show the size of the directories:
du -sh a_directory # 20 K (20 kilobytes) du -sh b_directory # 4K (directory is empty, 4K is the size of an empty dir) # Using the ? wildcard character, we can also compute the size of both # directories in a single command. du -sh ?_directory
-
Nothing to correct.
Objective: learn to use wildcard characters to match existing file names.
✨ Notes:
- The technical term for the expansion of wildcards characters by the shell is filename expansion, but it is also very often referred to as globbing.
- Globbing only matches existing file/directory names: expansion will not happen is there is no matching file/directory. This is why it's official name is filename expansion!
- 🔥
Tip: If you don't want a specific wildcard character to expand, you can
escape it by prefixing it with
\
. E.g.ls test_\*.md
will try to list a file named exactlytest_*.md
.
To start this exercise, enter the directory exercise_2/RedList_mammals
and
list its content with the command ls
.
You will see that it contains a large number of files, whose names are those
of the critically endangered mammal species as listed in the
International Union for Conservation of Nature (IUCN) Red List.
The species names are given in
binomial nomenclature (i.e. latin names),
and each file has the structure Genus_species
. E.g. if there a was file for
humans, it would be named Homo_sapiens
.
Using ls
and wildcard characters, perform the following tasks:
-
List all files starting with the letter
i
(upper or lower case).
🎯 Hint: you should have 1 match. -
List the files of Rhinoceros species (genus Rhinoceros, Dicerorhinus, and Diceros).
🎯 Hint: you should have 3 matches. -
List the files of Gibbon species from the genus Nomascus whose species name ends with either
r
ori
.
🎯 Hint: you should have 2 matches. -
List the files of species that meet both following conditions:
- The genus name contains the pattern "
l
+ a single letter + a letter betweena
andh
", e.g.lia
orlug
. - The species name starts with a
g
.
For instance, Eubalaena glacialis, the North Atlantic right whale, would be a match, because its genus name Eubalaena contains the pattern
lae
and its species name glacialis starts with ag
.
🎯 Hint: you should have 3 matches. - The genus name contains the pattern "
Exercise solution
-
There is only one file that starts with the letter
i
:cd exercise_2/RedList_mammals/ ls -l I* # Returns a single match: Indri_indri (a lemur species)
✨ Since all file names start with a capital letter,
ls -l I*
is sufficient to list all files starting with the letteri
. If there were also files starting with lower case letters, we would usels -l [iI]*
⚠️ Please note thatls -l [iI]*
andls -l i* I*
are not completely equivalent expressions:ls -l i* I*
will return an error unless there are both files starting withi
andI
(you can test it in your terminal). -
The critically endangered Rhino species are:
ls -l Rhinoceros_* Dicerorhinus_* Diceros_* ls -l Rhinoceros* Dicero* # Gives the same result. # Dicerorhinus_sumatrensis (Sumatran Rhinoceros) # Diceros_bicornis (Black Rhino) # Rhinoceros_sondaicus (Javan Rhinoceros)
✨ Since both the genus
Dicerorhinus
andDiceros
start withDicero
, we can match the patternDicero*
to get both genuses at the same time.🦏 There exists 2 other Rhino species:
- The White Rhino (Ceratotherium simum) is listed as "Near Threatened" by the IUCN. This species has two subspecies: the Northern and Southern White Rhino. The Northern White Rhino subspecies is critically endangered with only 2 female individuals remaining worldwide (living in semi-captivity in Kenya).
- The Greater One-Horned Rhino (a.k.a. Indian Rhino), Rhinoceros unicornis is listed as "Vulnerable" by the IUCN.
-
Gibbon Nomascus species whose species name ends in
r
ori
:ls -l Nomascus_*[ri] # Nomascus_concolor (Black crested gibbon). # Nomascus_siki (Southern white-cheeked gibbon).
-
Species matching both conditions:
ls -l *l?[a-h]*_g* # Eubalaena_glacialis (North Atlantic right whale). # Gorilla_gorilla (Western gorilla). # Plecturocebus_grovesi (Alta Floresta titi monkey - a new world monkey)
-
List the files of species who satisfy both of the following conditions:
- The genus name contains the pattern "
a
oro
, followed by exactly 2 letters, followed by the letterx
" (e.g.abix
oronyx
) - The species name ends either with an
i
or with the patternra
.
For instance, Pteralopex pulchra, the Montane monkey-faced bat, would be a match, because its genus name Pteralopex contains the pattern
opex
and its species name pulchra ends with the patternra
.🎯 Hint: this cannot be matched in a single expression only with regular file globbing (i.e. filename expansion). You will need to either i) use 2 expressions with regular globbing, ii) use brace expansion, or iii) use pattern matching.
🎯 Hint: you should have 4 matches.
- The genus name contains the pattern "
-
Try to add quotes (single or double) around a globbing pattern with wildcards, e.g.
ls -l "I*"
:- What difference does it make (if any)?
- Can you think of a use case for using quotes around a pattern with wildcards?
Additional tasks solution
-
File names matching the requested criteria:
ls -l *[ao]??x_*[ra] *[ao]??x_*i # Solution using pure globbing. Requires some duplication. ls -l *[ao]??x_*@(ra|i) # Solution using pattern matching. ls -l *[ao]??x_*{ra,i} # Solution using both globbing and brace expansion. # Myosorex_eisentrauti (Eisentraut's mouse shrew). # Pteralopex_flanneryi (Greater monkey-faced bat). # Pteralopex_pulchra (Montane monkey-faced bat). # Sorex_sclateri (Sclater's shrew).
✨ To avoid duplicating the
*[ao]??x_*
part, we can use either pattern matching or brace expansion.-
Pattern matching: here
@(ra|i)
matches either the patternra
ori
. -
Brace expansion: during the shell's processing, braces
{}
are expanded first (before globbing), and therefore:ls -l *[ao]??x_*{[ra],i}
is expanded into:
ls -l *[ao]??x_*[ra] *[ao]??x_*i
Before the actual globbing is performed.
-
-
Adding single or double quotes around the search pattern prevents the shell from performing file expansion (globbing). Instead, it will try to literally match the pattern. E.g.
ls -l I*
in the example below will try to find a file namedI*
, instead of any file starting with the letterI
.ls -l 'I*' # ls: cannot access 'I*': No such file or directory
One use case for adding quotes is if we e.g. want to store the pattern to match as a shell variable (e.g. in a shell script):
# We store the pattern "I*" in a variable named "search_pattern". search_pattern="I*" echo ${search_pattern} # Later we can use our stored pattern to match files: ls -l ${search_pattern} # -> lists all files starting with "I".
In this case, if we did not use quotes around
"I*"
when creating oursearch_pattern
variable, file globbing would have occurred and the value of the variable would have been set to the file(s) name that match the globbing pattern, and not the pattern itself.search_pattern=I* echo ${search_pattern} # The value of `search_pattern` is set to "Indri_indri" # instead of "I*"... not what we wanted.
Objective: learn to use the mkdir
, cp
and mv
commands.
Enter the directory exercise_3/
and perform the following tasks:
-
Create directories with the
mkdir
command:- In the directory
exercise_3/
, create 2 new sub-directories:species_by_genus
andspecies_by_common_name
. - In
species_by_genus/
, create 2 new sub-directories:Dendrolagus
(tree-kangaroos) andCrocidura
(a genus of shrews). - In
species_by_common_name/
, create 2 new sub-directories namedB
, andR
.
- In the directory
-
Copy files using the
cp
command:- From the directory
exercise_2/RedList_mammals
, make a copy of all files of the genusesDendrolagus
andCrocidura
into their respective sub-directories inspecies_by_genus
. - From the directory
exercise_2/RedList_mammals
, copy the file for the Red Wolf (Canis rufus) to the directoryspecies_by_common_name
.
- From the directory
-
Move and rename files with the
mv
command:- Enter the
species_by_common_name
directory. - In the directory, move the file
Canis_rufus
into subdirectoryR
. - Rename the
Canis_rufus
file you just moved into the subdirectoryR
to the common name of the species:Canis_rufus
->Red_wolf
.
- Enter the
-
Copy and rename files with the
cp
command:- Similarly to what we did for the Red Wolf file, we will now copy and
rename the file for the
Black Rhinoceros
Diceros bicornis, but this time we will copy and rename the file
in a single step using the
cp
command. - Copy the file
Diceros_bicornis
from its original location (inexercise_2/RedList_mammals
) intospecies_by_common_name/B
, while directly renaming it to the common name of the species:Black_rhino
.
- Similarly to what we did for the Red Wolf file, we will now copy and
rename the file for the
Black Rhinoceros
Diceros bicornis, but this time we will copy and rename the file
in a single step using the
-
Copy, rename and delete directories:
- Change directory to the root of the
exercise_3/
directory. - Copy the entire directory
species_by_genus/Dendrolagus/
with all its content to the root ofexercise_3
. - Rename the directory to
Tree-kangaroos
. - Delete the directory
Tree-kangaroos
and its content in a safe way.
- Change directory to the root of the
Exercise solution
-
Create the
species_by_genus
andspecies_by_common_name
directories.cd exercise_3 # Option 1: create one directory after the other. mkdir species_by_genus mkdir species_by_common_name # Option 2: create both directories with a single command. mkdir species_by_genus species_by_common_name # Option 3: use brace expansion to avoid repeating the common part # of the directory names. mkdir species_by_{genus,common_name}
Create sub-directories
Dendrolagus
andCrocidura
:# Option 1: create sub-directories while in the exercise_3 directory. mkdir species_by_genus/Dendrolagus species_by_genus/Crocidura # Option 2: enter the species_by_genus directory, then create the # sub-directories "Dendrolagus" and "Crocidura". cd species_by_genus/ mkdir Dendrolagus Crocidura cd .. # Option 3: same as option 1, but using brace expansion to avoid repetition. mkdir species_by_genus/{Dendrolagus,Crocidura}
Create sub-directories
R
andB
:mkdir species_by_common_name/{R,B}
✨ Note: using the
-p
option ofmkdir
, it is possible to create multiple levels of directories in a single command. For example, we could create all the directories for this exercise in a single command:mkdir -p species_by_{genus/{Dendrolagus,Crocidura},common_name/{R,B}}
🔥 Tip: if you want to preview the output of a brace expansion by the shell, you can run the command prefixed with
echo
: it will print the command that would be executed to the terminal without running the command.echo mkdir -p species_by_{genus/{Dendrolagus,Crocidura},common_name/{R,B}}
-
Copy files for
Dendrolagus
andCrocidura
:cp ../exercise_2/RedList_mammals/Dendrolagus_* species_by_genus/Dendrolagus/ cp ../exercise_2/RedList_mammals/Crocidura_* species_by_genus/Crocidura/
Copy the file for the Red Wolf:
cp ../exercise_2/RedList_mammals/Canis_rufus species_by_common_name/
-
Move and rename the Red Wolf file:
cd species_by_common_name/ mv Canis_rufus R/ # Move the file into its subdirectory. mv R/Canis_rufus R/Red_wolf # Rename the files to the common name of the species.
-
Copy and rename the file for the Black Rhino in a single
cp
command:# Note: this assumes you are currently in directory "species_by_common_name". cp ../../exercise_2/RedList_mammals/Diceros_bicornis B/Black_rhino
-
Copy, rename and delete a directory:
cd .. # Change directory to `exercise_3`. cp -r species_by_genus/Dendrolagus/ . # Copy the directory and its content. mv Dendrolagus/ Tree-kangaroos # Rename the directory.
To delete the directory in a safe way, we first delete all the files inside it, and then delete the empty directory with
rmdir
. Note thatrmdir
will not delete a directory if it's not empty - this is a safety behavior to avoid deleting large number of files by mistake.rm Tree-kangaroos/* rmdir Tree-kangaroos
🦘 Note: the faster way to delete the directory and all of its content is to use the command:
rm -rf Tree-kangaroos
.⚠️ This recursively delete the directory, and therefore one has to be careful to delete the correct directory, as you can otherwise very quickly delete large amounts of data by mistake, which can be problematic as there is no command to undo file deletion.
-
At the root of
exercise_3/
, create a new directory namedspecies_by_binomial_name
and enter it. -
Inside this directory, create sub-directories named
A
,B
,C
, ...Z
(i.e. one directory for each letter of the alphabet).To avoid doing this tedious work manually, you can use a for loop very similar to this example:
for x in {A..Z}; do echo ${x}; done
Try to run the above code in your shell (it will only print things to the screen without creating anything on disk). Then adapt the
for
loop so that it creates the directories forA
toZ
. -
Using a similar
for
loop again, copy all files fromexercise_2/RedList_mammals
into their correct subdirectory, i.e. the subdirectory that corresponds to the first letter of the Genus name. For example:Marmota_vancouverensis
should go into sub-directoryM
because the first letter of the genus name isM
.Note that when running the for loop, you will get some warning messages, because the genuses present in
RedList_mammals
do not cover all letters of the alphabet. However, this is not a problem here because it does not prevent thefor
loop from running to the end.
Additional tasks solution
# Create and enter the new directory.
mkdir species_by_binomial_name
cd species_by_binomial_name/
# Create directories "A" to "Z" with a for loop.
for x in {A..Z}; do mkdir ${x}; done
# Copy species file names into the correct directory.
# Note that letters that do not have any matching genus will print a warning
# to the terminal, but this does not prevent the loop from completing.
for x in {A..Z}; do cp ../../exercise_2/RedList_mammals/${x}* ${x}/; done
ls -l ./*
✨ The task of creating all the directories could also be done using brace expansion, like so:
mkdir {A..Z}
Objective: learn to use the find
command.
Go to the root of the directory exercises/
and perform the following
tasks using the find
command:
-
Find all files with a
.png
extension located anywhere inexercises/
. Make sure to list only files, and not directories.
🎯 Hint: you should find a total of 6 files. -
Find files with either a
.jpeg
or.png
extension. As before, directories should be excluded from the search results.
🎯 Hint: you should find a total of 15 files. -
Find all
.jpeg
files larger than15 kB
.
Exercise solution
-
Find all
.png
files. We use-type f
to restrict the search to files (and exclude directories).cd /path/to/directory/exercises # Make sure to be in the correct directory. find . -type f -name *.png # -> There are 6 ".png" files, located in ./images/img.png/
-
Find all
.jpeg
or.png
files. We use the-or
operator to combine both conditions. Note that the-type f
must be repeated for each condition.find . -type f -name *.png -or -type f -name *.jpeg # -> There are 9 ".jpeg" files, located in ./images/img.jpeg/ # -> There are 6 ".png" files, located in ./images/img.png/
-
Find all
.jpeg
files larger than15 kB
.find . -type f -name *.jpeg -size +15k
Try to use the -exec
option of find
to display full details of the
files that are found in the point 3 of the main exercise above.
One possible way to use the -exec
options is as follows:
-exec <command> "{}" +
, where<command>
is the custom command to execute and"{}"
is expanded to the files found byfind
.
The expected output for all .jpeg
files larger than 15 kB
looks like this:
-rw-rw-r-- 1 bob bob 25K Jan 8 11:16 ./images/img.jpeg/linux_logo.jpeg
-rw-rw-r-- 1 bob bob 26K Jan 8 11:40 ./images/img.jpeg/linux_gentoo_logo.jpeg
-rw-rw-r-- 1 bob bob 16K Jan 8 11:36 ./images/img.jpeg/linux_suse_logo.jpeg
Additional tasks solution
find . -type f -name *.jpeg -size +15k -exec ls -lh "{}" +
✨ The following syntax is also possible:
find . -type f -name *.jpeg -size +15k -exec ls -lh "{}" \;
In this case, the shell will execute an individual ls -l
command for each
file that was found, rather than passing all files to ls -l
. In the case of
the ls
command, this does not change anything, but there are commands that
accept only a single file as argument, in which case the solution with
"{}" +
would not work because all files are passed to the command in a
single call.
Objective: get familiar with shell commands that display text file content:
head
, tail
, cat
and less
.
From the root of the exercises/
directory, locate the file
protein_sequences.fasta
using the find
command and navigate to it.
Then perform the following tasks:
-
Display the start/end of the file using the
head
andtail
commands:- Display the first 10 lines of the file.
- Display the last 5 lines of the file.
-
Count the number of lines in the file using the
wc
command:- Count only the number of lines in the file.
- Count only the number of words in the file.
-
Display the content of the file using the
cat
command:- Why is this not the most adapted program here?
- Indicate another usage of cat?
-
Display, navigate and search the file with
less
:- Open the file using
less
. - Add lines numbers to the display using the
-N
option. - Navigate the file using the space bar and arrows.
- Search for the pattern
isoform
using the command/<search term>
, then navigate through the matches with the keysn
andN
. - Close the file with
q
.
- Open the file using
Exercise solution
-
Locate the file
protein_sequences.fasta
and navigate to it.find . -name "protein_sequences.fasta" cd ./exercise_5/fasta_files/
-
Display the first 10 and last 5 lines.
head protein_sequences.fasta # No need to specify -n 10 in this case # because 10 is the default value. tail -5 protein_sequences.fasta
🔥 Tips:
- If you want to display the entire file except for the last
X
lines you can usehead -n-X
(replaceX
by the number of lines you want to skip at the end of the file). - Conversely,
tail -n+X
will skip the firstX
lines, and then print all the remaining lines till the end of the file.
- If you want to display the entire file except for the last
-
Count the number of lines and words in the file.
wc -l protein_sequences.fasta # 19222 lines. wc -w protein_sequences.fasta # 51914 words.
-
Display the content of the file with
cat
. As you can see, this is not an ideal solution for this file because it is large.cat protein_sequences.fasta
One usage of
cat
is concatenate 2 or more files together (this is where the command got its name from).cat
concatenates files by pasting their content one after another.
Here is an example:# Create 2 files to concatenate: head -n5 protein_sequences.fasta > file_1 tail -n5 protein_sequences.fasta > file_2 # Concatenate our 2 files into a file named "file_3". cat file_1 file_2 > file_3 cat file_? > file_3 # Same as above, but using filename globbing.
✨ Bonus: we could also create
file_3
without using any intermediate file. The<( )
syntax is called process substitution and allows to treat the output of a command as an input file.cat <( head -n5 protein_sequences.fasta ) <( tail -n5 protein_sequences.fasta )
✨ To concatenate multiple files by columns, use the
paste
command. -
Display the content of the file with
less
. Remember that to exitless
, you must press theq
key on your keyboard.less protein_sequences.fasta less -N protein_sequences.fasta # Line numbers can also be added/removed # after a file was opened with "-N" + "enter".
Display only the line 100 of the file protein_sequences.fasta
by using a
combination of head
and tail
.
For this you will need to use the the |
(pipe) operator, that allows to
redirect the output of one command into another command.
Additional tasks solution
head
and tail
can be combined to display any section of a file. Here
we print the line 100 of the file:
head -n100 protein_sequences.fasta | tail -n1 # Print the 100th line.
Context: in this exercise, we will be working with 3 files containing
replicates from a
micro-array experiment. They
are located in exercise_6
:
micro-array/data/array_data-1.csv
micro-array/data/array_data-2.csv
micro-array/data/array_data-3.csv
To get an overview of the structure of these files, you can have a look at
the start of each file with head array_data-*
(you need to be in the
directory exercise_6/micro-array/data/
to run this command).
Each file contains 2 columns, separated by ;
characters:
- Column 1: gene name (
ProbeID
). - Column 2: level of gene expression in each replicate
(
Sample1
,Sample2
,Sample3
).
==> array_data-1.csv <==
ProbeID;Sample1
1007_s_at;10.93
1053_at;8.28
117_at;3.31
==> array_data-2.csv <==
ProbeID;Sample2
1552390_a_at;2.76
1552389_at;3.4
1552388_at;2.61
==> array_data-3.csv <==
ProbeID;Sample3
1007_s_at;11.19
1053_at;8.06
117_at;3.13
Our objective is to merge these three files into a single tab-delimited
file with 4 columns: ProbeID
(gene name), Sample1
, Sample2
and Sample3
(the gene expression levels recorded in each replicate of the experiment).
Additionally, the output file should contain tab-delimited values rather than semi-column separated values as is the case in the input files.
Our final output should thus look something like this:
ProbeID Sample1 Sample2 Sample3
117_at 3.31 3.41 3.13
121_at 4.42 4.32 4.46
1007_s_at 10.93 11.44 11.19
1053_at 8.28 7.54 8.06
...
Please note that one complication comes from the fact that the order of the genes in the input files is not the same across all files... Therefore the values from the input files will have to be sorted before the columns can be merged.
As this exercise is a bit more complicated, we will decompose it into several steps.
Enter the directory exercise_6/micro-array/data
and have a look at the
3 input files using less
or head
, to get familiar with their structure
and content.
The original files contain tabulated data separated by a ;
(semi-column)
character. We would like to change this separator to a tab \t
.
- Convert the delimiters from
;
to\t
(tab) using the commandtr
. - Save the converted content into a new file with a
.tsv
extension. E.g. the converted version ofarray_data-1.csv
should be namedarray_data-1.tsv
.
Step 2 solution
tr ";" "\t" < array_data-1.csv > array_data-1.tsv
tr ";" "\t" < array_data-2.csv > array_data-2.tsv
tr ";" "\t" < array_data-3.csv > array_data-3.tsv
head *.tsv
# Loop shortcut:
for i in $(seq 1 3); do tr ';' '\t' < array_data-${i}.csv > array_data-${i}.tsv; done
🐧
fun fact: tr
is one of the (few) commands that does not accept a file as
input, it only accepts input from stdin. This is why we must use tr < file
or cat file | tr
to pass input to tr
.
Before we can merge the content of the 3 array_data-*.tsv
files, we have to
make sure that the order of rows in each file (i.e. the order of genes) are the
same.
If you look at the files, you will see that this is not the case. Therefore the
files need to be sorted by their 1st column (ProbeID
).
- Sort each
array_data-*.tsv
file by itsProbeID
column using thesort
command. - Make sure that the header line of the file remains at the top after the
sorting operation.
🎯 Hint: the header line is the only line that starts with a letter, so we can take advantage of numerical sorting that sorts letters before numbers. You can have a look atman sort
(the help for thesort
command) to find the right option. - Save the sorted output to files named
array_data-sorted-*.tsv
.
Step 3 solution
sort -n array_data-1.tsv > array_data-sorted-1.tsv
sort -n array_data-2.tsv > array_data-sorted-2.tsv
sort -n array_data-3.tsv > array_data-sorted-3.tsv
head *sorted-?.tsv
# Loop shortcut:
for i in $(seq 1 3); do sort -n array_data-${i}.tsv > array_data-sorted-${i}.tsv; done
In this step, we will simply double-check that the sorted files we created
at the previous step (array_data-sorted-*.tsv
) have the same number of lines
and that the genes on those lines are in the same order.
To verify this, the idea is to extract the first column (the gene names) from the sorted files, and compare it across the 3 sorted files.
Proceed as follows:
- Use the
cut
command to extract the first column (ProbeID
) of eacharray_data-sorted-*.tsv
file. Save the output to temporary files namedtmp-1.tsv
,tmp-2.tsv
, andtmp-3.tsv
(these files should contain a single column). - Compare the content of the 3
tmp-*.tsv
files using the commandsdiff -s tmp-1.tsv tmp-2.tsv
anddiff -s tmp-1.tsv tmp-3.tsv
. - If all
tmp-*.tsv
files are the same (and they should), you can now delete these files. Their only purpose was to allow us to check that all files have their lines sorted in the same order.
Step 4 solution
# Extract the first column of each file into a temporary file.
cut -f1 array_data-sorted-1.tsv > tmp-1.tsv
cut -f1 array_data-sorted-2.tsv > tmp-2.tsv
cut -f1 array_data-sorted-3.tsv > tmp-3.tsv
# Loop shortcut:
for i in $(seq 1 3); do cut -f1 array_data-sorted-${i}.tsv > tmp-${i}.tsv; done
# Compare files that contain only the first column:
diff -s tmp-1.tsv tmp-2.tsv
diff -s tmp-1.tsv tmp-3.tsv
rm tmp-?.tsv
✨ Bonus: if we want to avoid creating temporary files, we can use a functionality of the shell called process substitution, where the output of a command is treated as a sort of virtual file.
diff -s <(cut -f1 array_data-sorted-1.tsv) <(cut -f1 array_data-sorted-2.tsv)
diff -s <(cut -f1 array_data-sorted-1.tsv) <(cut -f1 array_data-sorted-3.tsv)
In this final step, we concatenate the content of the array_data-sorted-*.tsv
files to produce a final file (final.tsv
) with 4 columns:
ProbeID
, Sample1
, Sample2
, and Sample3
.
- Use
paste
to concatenate the content of the-sorted-*.tsv
files. - Use
cut
to remove the duplicatedProbeID
columns. - These two steps can be performed in a single command line by using a pipe
operator (
|
) to pass the output ofpaste
tocut
.
Your final output file should look like this:
ProbeID Sample1 Sample2 Sample3
117_at 3.31 3.41 3.13
121_at 4.42 4.32 4.46
1007_s_at 10.93 11.44 11.19
1053_at 8.28 7.54 8.06
1255_g_at 1.8 1.7 1.75
...
Step 5 solution
paste array_data-sorted-*.tsv | cut -f 1,2,4,6 > final.tsv
head final.tsv
✨
Note: to learn how to do the entire processing we did in this exercise
in a single command, look at the Additional Tasks 6 section below. It
also shows how the processing can be done using the join
command, but it's
actually more complicated in this case.
Try to re-write the entire process as a single command. For this you will need to use the following shell functionalities:
- The Pipe operator
|
, to redirect the input of one command into the next. - The process substitution
syntax
<( command )
, to treat the output of a command (or of a series of commands) as a virtual file from which data is read.
This task is a bit more complicated, so to get you started, here is an example to give you some inspiration. You can run the command bellow in your shell:
paste <(sort -n array_data-1.csv | cut -f2 --delim=";") <(sort -n array_data-2.csv | cut -f2 --delim=";") | head
Additional tasks solution
Here is how we could do the whole processing of this exercise in a single command without creating any intermediate files.
paste <(sort -n array_data-1.csv | tr ";" "\t") \
<(sort -n array_data-2.csv | cut -f2 --delim=";") \
<(sort -n array_data-3.csv | cut -f2 --delim=";") > final_2.tsv
# Compare the new output file with the one we produced earlier:
diff -s final*.tsv
diff -s final{,_2}.tsv
Another alternative is to use the join
command. Unfortunately, the join
command only allows to join 2 files at a time, so in this case the solution
ends-up being more complicated (join
would however be much better in cases
where we want to join files that have missing rows - i.e. not all files have
all the raws):
join --check-order -o 0 1.2 1.3 2.2 -1 1 -2 1 -t ";" \
<( join --check-order -o 0 1.2 2.2 -1 1 -2 1 -t ";" \
<( sort array_data-1.csv ) \
<( sort array_data-2.csv ) \
) \
<( sort array_data-3.csv ) | sort -n | tr ";" "\t" > final_3.tsv
# Compare the new output file with the one we produced earlier:
diff -s final{,_3}.tsv
In this exercise, we will work with a copy of the file
exercise_5/fasta_files/protein_sequences.fasta
. This file is a so-called
FASTA file. FASTA is a
text-based format to represent nucleotide or protein sequences.
- FASTA files can contain one or more sequences.
- Each new sequence starts with a sequence header line, which starts with
the character
>
. A sequence header is always on a single line. - Each sequence header is followed by one or more lines that contain the nucleotide or amino acid sequence of the sequence.
Here is an example of a section of a FASTA file:
>sp|P18823|ACCD_PEA Carboxyl transferase OS=Pisum sativum GN=accD PE=1 SV=3
MINEDPSSLTDMDNNIDSWKNNSENSSYSHADSLADVSNIDNLLSDKIFSIRDSNSNIYD
IYYAYDTNDTNITKYKWTNNINRCIESYLRSQICEDIDFNSDICDKVQRTIIILIRSTND
NDISDTNDISDTNDTNDTNAIYDPFDISDTNDTN
>sp|P09339|ACON_BACSU Aconitate hydratase OS=Bacillus subtilis GN=citB PE=1 SV=4
MANEQKTAAKDVFQARKTFTTNGKTYHYYSLKALEDSGIGKVSKLPYSIKVLLESVLRQV
DGFVIKKEHVENLAKWGTAELKDIDVPFKPSRVILQDFTGVPAVVDL
Enter the directory exercise_7
and start by making a copy of the file
exercise_5/fasta_files/protein_sequences.fasta
in that directory. Name
the copy of the file sequences.fasta
.
✨
If you are on a Linux/Mac, you may also create a symlink instead of
copying the file:
ln -s ../exercise_5/fasta_files/protein_sequences.fasta sequences.fasta
A symlink creates a pointer to a file, without making an actual copy of it.
Symlinks are not supported on Windows (except if using WSL an working on a
non-windows partition).
Have a look at the sequences.fasta
file (e.g. using the less
command).
Then answer the following questions using the grep
command:
- How many sequences are there in the file? 🎯 Hint: count the number of header lines in the file.
- How many entries are from
Staphylococcus
? - Display header lines that are not from
Staphylococcus
?
Here is a reminder of some of the grep
options:
-i
: case insensitive search.-c
: suppress normal output; instead print the count of matching lines.-o
: print only matching content, not the entire line.-n
: add the line number in front of printed output.-v
: inverted search - print lines that do not match the pattern.
Part A solution
cd exercise_7/
cp ../exercise_5/fasta_files/protein_sequences.fasta sequences.fasta
# Count the number of sequences in the file:
grep -c "^>" sequences.fasta # -> 3325 sequences.
# Count the number of sequences from Staphylococcus
# Note the use of the `-i` option of `grep` ("case insensitive search").
grep -ci "os=staphylococcus " sequences.fasta # 141 sequences.
# Display the header lines of sequences that are not from Staphylococcus.
grep "^>" sequences.fasta | grep -vi "os=staphylococcus "
grep "^>" sequences.fasta | grep -vi "os=staphylococcus " | wc -l # The sequence count is 3184.
In the second part of this exercise, your task is to display the 10 most
frequent genuses in the sequences of the sequences.fasta
file, along with
their frequency (i.e. the number of sequences for each of the 10 most-frequent
genus in the file).
Here is a suggested way to perform this task:
- Isolate the header lines.
- Isolate the genus name from each line. To do this, you can take advantage
of the controlled vocabulary in the file: the organisms name is always
prefixed with
OS=
. - Sort the genus names, compute their frequency and keep only a single instance of each genus name.
- Sort the genus by frequency and keep only the 10 most frequent.
🎯 Hints:
- The steps above are best done as part of a pipeline: use the
|
(pipe) operator to pipe the output of one command into the next. - When building the pipeline and doing tests, you can end your pipeline with
| head
so that you avoid printing the whole file each time.
🎯 Additional hints:
click to show more hints, if needed
Here are some commands and their options that are useful for this exercise:
uniq -c
: the-c/--count
option prefixes each line with the number of occurrences.sort -nr
:-n/--numeric-sort
sorts numerically instead of alphabetically.-r/--reverse
sorts in decreasing order.grep -o
: the-o/--only-matching
option returns only the matching part of a line instead of the entire line (the default grep behavior).
Part B solution
There are multiple ways to perform this task, here are a few possibilities.
Here are 3 different variants using a pipeline built around grep
and cut
.
- ✨
Note: some pipelines make use of the
grep
option-o
, which instructsgrep
to only output the actual matching pattern instead of the entire line on which the match is found.
grep "^>" sequences.fasta | cut -f2 --delim="=" | cut -f1 --delim=" " | sort | uniq -c | sort -nr | head
grep -o "OS=[a-zA-Z]*" sequences.fasta | cut -f2 --delim="=" | sort | uniq -c | sort -nr | head
# Same as above, but using the "[[:alpha::]]" syntax to indicate we only want
# to match alphabetic letters and not e.g. spaces (or numbers).
grep -o "OS=[[:alpha:]]*" sequences.fasta | cut -f2 --delim="=" | sort | uniq -c | sort -nr | head
# Output of the pipe: the 10 most frequent genus and their frequency in the file.
168 Arabidopsis
166 Escherichia
163 Bacillus
152 Homo
141 Staphylococcus
134 Mus
111 Oryza
84 Salmonella
83 Rattus
72 Mycobacterium
Here is another solution that makes use of a more complicated regular expression to directly isolate the genus name. For this we must:
- Using
grep
with "Perl"-style regular expressions by adding the-P
option. - Use a lookbehind match:
(?<=OS=)
matches something located behind the patternOS=
.
grep -oP "(?<=OS=)[a-zA-Z]+ " sequences.fasta | sort | uniq -c | sort -nr | head
🙈 Regular expressions are a powerful tool to do sophisticated pattern matching. However they are beyond the scope of this course.
This is not an easy one, but it's the last!
Our objective is to write a short for
loop that performs the task
of copying each species files found in the exercise_2/RedList_mammals
directory into the correct directory for its genus in a species_by_genus
directory.
So basically, instead of only doing it for 2 genus manually as we did in exercise 3, we want to have it done automatically for all genuses.
🎯 Hints: this task is more difficult and uses a few concepts that were not presented in the course, such as:
for
loops to repeat a number of instructions multiple times while iterating over a range of values. In our case, we want to iterate over the list of genuses.- Variables: in bash, variables can be:
- Created using
variable_name=value
. - Accessed using
${variable_name}
.
- Created using
Here is a scaffold of one possible solution to get you started.
# Enter the "exercise_7" directory and create a new "species_by_genus"
# directory.
cd exercise_7/
mkdir species_by_genus
# Save the "RedList_mammals" directory location in a variable, so it will be
# easy to access later.
red_list_dir=../exercise_2/RedList_mammals
# Loop through all genus values and copy the files for each in the correct
# sub-directory of "species_by_genus".
for genus in $( <pipeline that returns the list of genus> ); do
mkdir species_by_genus/${genus} # Create directory for genus.
cp ${red_list_dir}/... ... # Copy files for genus.
done
# List all the copied files to see if the result is correct.
ls species_by_genus/*
What you have left to do in the code above is to:
- Replace
<pipeline that returns the list of genus>
with a series of commands that will produce the list of unique genus present inRedList_mammals
. - Replace
cp ...
with the proper command to copy all files for a given genus.
Additional tasks solution
# Enter the "exercise_7" directory and create a new "species_by_genus"
# directory.
cd exercise_7/
mkdir species_by_genus
# Save the "RedList_mammals" directory location in a variable, so it will be
# easy to access later.
red_list_dir=../exercise_2/RedList_mammals
# Loop through all genus values and copy the files for each in the correct
# sub-directory of "species_by_genus".
for genus in $( ls ${red_list_dir} | cut -f1 --delim="_" | sort | uniq ); do
mkdir species_by_genus/${genus} # Create dir for genus.
cp ${red_list_dir}/${genus}_* species_by_genus/${genus} # Copy files for genus.
done
# List all the copied files to see if the result is correct.
ls species_by_genus/*