nohup : running a program in the background

The importance of the nohup command

The command nohup is especially useful if you’re running programs on a remote computer. Any connection failure (broken pipe) between you computer and the remote computer will stop the program you are running if you didn’t use nohup.

And believe me, that will happen. And if the program has been running days (or weeks) you will have to re-run it and you will have lost all that time.

It can be extremely frustrating so… use nohup

nohup is used to ignore hangup signals such as, a broken pipe or exiting the terminal before the program finished, which would normally stop your program.

When you use nohup, the output of the program you are running which would normally appear on the terminal will go to a file called nohup.out in your current directory (if you haven’t specified another file name).

How to use nohup?

Just use the program the way you normally use it adding “nohup” in the begining and the “&” symbol at the end :

$ nohup program-to-run &

if your program outputs on the terminal, redirect the terminal output :

$ nohup program-to-run > outputfile.txt &

If your program has two outputs (standard output and error output) you can redirect both outputs to files of your choice:

$ nohup 'program-to-run' > 'outputfile.txt' > 'program-messages.txt' &

Standard output of the program, here called outputfile.txt (std.out)
Standard error output containing errors but sometimes just messages from the program, here called program-messages.txt (std.err).

Accessing a computer remotely

What is accessing a computer remotely?

 

Accessing a computer remotely is entering and using the resources (such as memory and CPU) of a computer which isn’t physically available (“remote computer”) by the means of another computer (‘source’ computer).

When and why is it interesting to access a computer remotely?

 

Using a remote computer can be extremely useful if you need to conduct experiments which
1- require a computer more powerful than the one you actually have
2- Keep the data of an output in another computer

What is necessary to access a remote computer?

 

– The fixed IP address of the remote computer

Usually computers do not have a fixed IP. However, powerful computers that stay constantly on are more likely to have a fixed IP.

– Authorization to use ssh (Secure Shell)

Basically this means that when you will try to enter the remote computer by its ssh port (usually port 22 of the computer) a program (daemon) will be there listening for information that comes through the port. If ssh was not enabled in the remote computer, the ssh port will not be ‘listened to’ and the messages you will send to the remote computer will be ignored. (I’m no pro so I may be over simplifying)

How how to access a remote computer?

 

1- Getting the IP address of the remote computer (in Unix-like OS)

Open a terminal (on the computer you’ll want to access remotely)
Type ifconfig and enter

The IP address will be in the line starting with ‘inet adr’ (Ubuntu) or ‘inet’ (MacOS).

Examples:

inet adr:123.4.0.1 -> 123.4.0.1 is the current IP (Ubuntu)
inet 123.123.234.10 -> 123.123.234.10 is the IP (Mac OS)

Warning : A computer with a fixed IP also has a varying IP. Make sure to take the fixed IP (For me, it has been the first one I encounter). If after a time you cannot access the remote computer you may have taken the varying IP adress instead of the fixed one

2- Accessing the remote computer (from your ‘source’ computer)

Open a terminal and follow the following steps :

$ ssh user_name@IP-address-of-remote-computer

Example:

$ ssh sophie@123.123.234.10

The authenticity of host '123.123.234.10 (123.123.234.10)' can't be established.
RSA key fingerprint is c1:c1:02:0a:32:06:f1:05:56:0f:58:01:21:02:0b:47.
Are you sure you want to continue connecting (yes/no)? yes   -> type yes
Warning: Permanently added '123.123.234.10' (RSA) to the list of known hosts.
Password:                      -> Give a Password for the remote computer

You’re done ! You now have, in your computer, the terminal of the remote computer. All programs you will launch from this terminal will run on the remote computer.

If an error appear, you may need to make the remote computer user able to use the computer remotely.

3- Optional step : Creating a set of key (public and private)

This step is good to follow to secure the remote access. I did it using a MacOS computer so it may be different for other OS. On Ubuntu, it should be too different.
For Windows users, seriously consider installing a Unix Operating System such as Ubuntu.

Open a terminal

$ cd ~/.ssh
$ ssh-keygen -t rsa (MacOS)  #  $ ssh-keygen (Ubuntu)

Generating public/private rsa key pair.
Enter file in which to save the key (/Users/Sophie/.ssh/id_rsa): <authorized-keys>

Enter passphrase (empty for no passphrase): <ENTER>
Enter same passphrase again: <ENTER>
Your identification has been saved in id_rsa.pub.
Your public key has been saved in id_rsa.pub.pub.
The key fingerprint is: long:string:of:stuff Sophie@computer

The key's randomart image is:
<An image will appear here>

And then?

Then you’re done. From now on, when you’ll want to access the remote computer, you will type in the terminal of your ‘source’ computer : $ ssh User@fixed-IP, you will be asked for your password (which will not appear when you’ll type it) and it will connect to the remote computer.

Last remark: There may be a delay between your typing on the ‘remote computer’s terminal’ and the appearance of the text on the screen. Do not get frustrated about it and organize your command line beforehand.

Fun and brisk Python learning

Last week I started reading Think Python – How to Think Like a Computer Scientist by Allen Downey.

It is a funny, fast-paced book to learn python from the beginning. No need to know the computer jargon, everything is explained in a simple and effective way.

I haven’t read it all but the first chapters are quite enjoyable and that definitely helps the learning process. At the end of each chapter, there are exercises which will help you program, debug (find errors in a programs) and learn the synthax and vocabulary of Python.

You can download the PDF of the book using the following link: http://www.greenteapress.com/thinkpython/thinkpython.pdf 

Hope you’ll enjoy it as much as I am!

Most useful bash commands

In this post I will list the most useful commands I’ve come across so far.

Note about options

Commands and other programs can have options that are very interesting to use in some cases. An option is like the ‘way’ the command will execute what you asked from it. They are like flags you put after the command / program you want to run. They usually start by ‘-‘ or ‘–‘.

For example:
‘-v’ is an option that can mean ‘verbose’.
‘-h’ usually stands for ‘help’

The commands I will present have options but I will not go through their options.

A doubt about how to use a command? What options can be used?

The commands man (manual) and info (information) have it all documented neatly and their pages are available through the terminal as follows

man 'command'
info 'command'

The manual has a manual page:
man man

You can get information about info:
info info

Navigating in folders (and files)

cd /path/to/directory # Change Directory
ls # List directory content
pwd # Prints the complete path of the current/working directory

Create, move, remove folders and/or file

mv file1.txt file2.txt # rename file1.txt into file2.txt (will overwrite file2.txt if it exist, check -i option)
mv file1.txt new/directory # move file1.txt to a new directory

cp file1.txt file2.txt # makes a copy of file1.txt called file2.txt
cp file1.txt new/directory # Copy file1.txt to a new directory

mkdir directory_name # Make directory

rmdir directory/to/remove # remove directory (only empty ones)
rm file.txt # Remove the file.txt (CAREFUL with this one. There is no coming back)

Getting information on files

head input.txt # Get the 10 first line of the input.txt
tail file.txt # Get the 10 last line of the input.txt

grep 'pattern' input.txt # Outputs all the lines of input.txt which contain ‘pattern’

Some options for grep are very useful:
-o will output only the part of the line of the input file corresponding to the pattern
-v will invert the match (output all the lines which do not have the pattern)
-c will count the number of lines in which the pattern appears

Regular Expressions can be used in the pattern. See the wikipedia page for further information; http://en.wikipedia.org/wiki/Regular_expression. Also I hope to post about them soon.

diff file1.txt file2.txt # Outputs the differences between file1.txt and file2.txt

wc file.txt # Prints on the terminal the number of newlines, words, and byte counts for file.txt (can be used with more then one file)

Manipulating files

cat file1.txt file2.txt # Concatenate file1.txt and file2.txt. The output will be the lines of file1.txt followed by the lines of file2.txt

cut # remove sections from each line of files
cut -d ',' -f 2,7 file.txt # Keep columns 2 and 7 from file.txt which has its columns delimited by commas (-f option stands for field; -d option stands for delimiter of the fields). This is usually used when you want to keep only certain columns of a table

paste # Merge lines of files.
paste file1.txt file2.txt # The first line of file2.txt will be put next to (on the same line as) the first line of file1.txt in the output. This is usually used if you want to create a table where some of the columns are on one file and others on another. Just remember that the lines must correspond to each other (line1 with line 1 of the second file, line 2 with line 2, etc)

join file1.txt file2.txt # join lines of two files on a common field

sed 's/thing-to-replace/replacement-text/' file.txt  # Stream Editor, the ‘s’ means you are doing a substitution of the ‘thing to replace’ by the ‘replacement-text’. You can also use Regular Expressions in the text to replace.
sed '1,2 d' file.txt # Delete lines 1 and 2 from file.txt

tr  # translate or delete characters.
tr ',' '\t' < file.txt  # Replace the commas by tabs in file1.txt
WARNING FOR MAC USERS: ‘\t’ which stands for ‘tab’ in Ubuntu does not work for Mac OS. To enter the ‘tab’, press [Control] + [V], followed by [Tab]

Text Editors

You can open a text editor by just typing its name on the terminal.
nano or pico # Small Text Editors that open in the terminal. When you open them, you will see the keys you must use to save, exit, etc. The ‘^’ symbol means ‘control’.
emacs
vim
gedit # for ubuntu users

Others

bc # Basic calculator. In ‘bc’ you will be able to do some calculations.
history # lists the recent command lines you’ve used

Ok, that’s it for now. If you know these commands and how to use them you can do the GREAT deal of things.

The Terminal : Don’t panic

This (usually) black (or white) plain window can seem a bit unfriendly at first. I myself thought is was reserved to some genius computer ninjas who, knowing how to use it, had great magical powers.

I was right.

You do acquire great magical powers when you start using the terminal. I’m not sure about the ninja part but it does feel like it.

And I was also completely wrong.

Using the terminal is not reserved to some elite computer geniuses. It’s practical, efficient and you won’t get lost in lists of menus. It’s to the point. Actually, if you want to analyse large data sets (like RNA-seq data), the terminal will make your life a whole lot easier.

What usually scares people is that it doesn’t have a very developed graphical interface. So the are no buttons to click on. But no worries, it is not harder to use than any other program. It’s just different.

So what is the terminal?

The terminal is a great program, which provides you the means to runs commands (bash programs) or other programs (written in Python, Java, Perl, C…).

What can you do with the terminal?

  • Install Programs, compile them… (the most boring part)
  • Use the command line tools
  • Run programs or scripts

In the next post I will give more substantial information, I promise

One last thing : The colors and the font of the terminal can be changed… You don’t have to stick with the defaults. I like having the terminal in a completely different color than the other windows so that I can spot it easily on the screen.

Example of a road map to analyse RNA-seq data

My master’s project was about finding the genes that could possibly be responsible for (or involved) in the resistance phenotype of a flesh-eating fly, the new-world Screw Worm.

A little bit of context

The New-World Screw worm is a flesh-eating species which is the cause of many problems. The insecticides (organophosphates) used to control it are losing efficiency as the populations are becoming resistant.

My objective

My mission was to find the genes Differently Expressed (DE) between a control group and a resistant group using an RNA-seq pipeline. In a pipeline, the output of one step is the input of the next step.

In this section

In the RNA-seq section of this blog, I will explain post by post the steps I undertook to reach my objective. Our road map is the following:

  • Processing the raw read (trimming the regions of bad quality, sorting the reads, evaluating the quality)
  • Collapsing the processed reads
  • De novo assemble the reads into contigs (reconstruction of the original transcript as best as possible)

In the case of my species, little genetic information was available, its genome had not been sequenced so, the reads had to be assembled into contigs ‘de novo’, i.e without the help of a genome of reference

  • Annotate the contigs by local blast
  • Align the processed reads of each condition (control and resistant) and replicate against the assembled contigs
  • Estimate the level of expression of each contigs in each condition and replicate
  • Test which contigs are DE between the conditions
  • Annotate by remote blast the DE contigs
  • Do a Gene Ontology (GO) analysis to see if the over-expressed and under-expressed contigs (DE contigs) are enriched in some gene characteristics when compared to all the contigs assembled
  • Find Variants (SNPs and InDels) between both conditions (maybe)

Lots of posts ahead of us!

Some Ideas of Challenges

Programming is about creating. It’s about making the computer do an automated task. Writing scripts is the best way to memorize the syntax of the language and improve your programming skills such as knowing how to break down a problem in parts, debugging faster or knowing which tools you’ll need to solve your problem.

Here is a list of little challenges I was given to get the hang of programming in Python (try doing them by defining functions):

  • Order a list of numbers : [2,0,8,-1,6] -> [-1,0,2,6,8]
  • Define a function which returns the divisors of a number
  • Given an interval (2 numbers), print (or return) a list of all the perfect numbers of that interval.

A is a perfect number if the sum of its divisors is equal to A.
6 and 28 are perfect numbers.

  •  Given an interval, print (or return) a list of all the amicable numbers (or “friend” numbers) found in that interval.

A and B are amicable (friends) if the sum of the divisors of A is equal to B and the sum of the divisors of B is equal to A.
220 and 284 are amicable numbers
Remark : Perfect numbers are friends with themselves

  • Given an interval, print (or return) a list of all the prime numbers found in that interval.
  • Draw a sudoku: Given a sudoku in the form of a string (=sequence of characters) of 81 characters, draw it on the computer
  • Know if a character is being repeated in a string
  • Reverse a string: ATGC -> CGTA
  • Get the reverse complement of a sequence: ATGC -> GCAT
  • Is your string a palindrome? Return True if it is, False otherwise.

A palindrome is a string which is the same is you reverse it.
‘racecar’, ‘noon’, ‘hannah’, ‘radar’ and ‘ovo’ are examples of palindromes

And, of course, you can invent your own challenges! I advise you to start with small challenges, that won’t take too long to solve otherwise you risk getting too overwhelmed or frustrated or upset and then you may want to give up on programming. And you definitely don’t want that to happen! Programming gets easier and easier as you learn more about it and create new programs. Not mentioning, it becomes fun!

Defining your own functions in Python

When facing a programming challenge, you can solve it by creating a function or several functions which work together.

Creating functions makes your script :

  • easier to review
  • clearer for other people who may want to understand your algorithm (= the steps undertaken to solve the problem)
  • faster to debug (easier to find to mistakes)

Using functions also forces you to break down the challenges into smaller parts easier to code and manage. Another advantage: you can reuse them in other scripts if need be

The function ‘def‘ enables you to created your own functions which you can then call in another function (already implemented in Python or also created by you).

Here is an example of a script which uses many functions to “collapse” a fastq file. Collapsing consists in finding identical sequences in the fastq file and retaining only one copy. Since, in the fastq format, a “sequence” has its information in 4 lines, when a duplicate sequence is found, we have to discard the 4 lines corresponding to that “sequence”


################## Begining of the script collapse-fastq.py ##################
# Authors Sophie and Tássio Naia Tandonnet
# Jan 2014
#
# This is free software, see GNU public license version 3 or later.
#

'''
Collapses a fastq file, removing duplicates of sequences.

A fastq file has the following format:

@seq3
atttccgetttt
+
''Ax.)-Z.s]]!
@seq1
atttccgetttt
+
''Ay.AxSZ.s]]!

that is, groups of four lines (block), the first line is the
seq-id, the second is the sequence, the third is the qual-id and the fourth, the qualities
'''

# Strategy
# --------
# for each block of the fastq file
# if the sequence has not been seen
# output block

# At this point, we start defining our data structures. We
# agree to store the block in a 4-tuple.

def next_block(fastq_file):
    '''Reads next four lines of fastq_file
    and returns a tuple with the lines.'''

    l1 = fastq_file.readline()
    l2 = fastq_file.readline()
    l3 = fastq_file.readline()
    l4 = fastq_file.readline()
    return (l1,l2,l3,l4)

# How do we recall which sequences have been seen?
# We store the seen sequences in a list.

def has_been_seen(block, seen_sequences):
    '''Returns True if the second string of the block tuple is
    in the list, and False otherwise.'''
    return block[1] in seen_sequences   

# last line equivalent to: 
# if block[1] in seen_sequences:                                                 #    return True

def output_block(b):
    '''Prints the strings in the 4-tuple b'''
    print(''.join(b),end='')

def add_sequence(block,seen_sequences):
    '''Add the second string of the block-tuple to the list
    seen_sequences.'''
    seen_sequences += [block[1]]

def valid_block(b):
    '''Returns True if b[0] is not an empty string, returns False
    otherwise.'''
    return b[0] != ''     # equivalent to:
                          # if  b[0] != '':
                          #     return True

# Here we go! Are we missing sth?

import sys   # The module sys allows us put the input file (fastq file 
             # in this case) on the command line when we call the script 
             # instead of having to alter the script each time we want 
             # to run it on a different file.

filename = sys.argv[-1]    # the input file is the last argument (-1) of the 
                           # command line you will write to run the script

fastq_file = open(filename) # Opening input file

seen_sequences = []   # Creating a list
block = next_block(fastq_file)

while valid_block(block):
    if not has_been_seen(block, seen_sequences):
        output_block(block)
        add_sequence(block, seen_sequences)
    block = next_block(fastq_file)

fastq_file.close()    # closing the input file

#################### End of the script collapse-fastq.py ####################

To run the script (supposing it is saved in a text file as collapse-fastq.py), you would type the following in the terminal:


$ python /path/to/collapsed-fastq.py path/to/fastq-file/to/collapse/file.fq

REMARK: In the beginning it is not easy to write a script mostly in the form of defining functions. You have to think about which function would be good to create and even start writing your code calling functions which do not yet exist…

In the previous example, we started by writing:


seen_sequences = []   # Creating a list
block = next_block(fastq_file)

while valid_block(block):
    if not has_been_seen(block, seen_sequences):
        output_block(block)
        add_sequence(block, seen_sequences)
    block = next_block(fastq_file)

without having defined ANY functions

It can make you uneasy but it’s a habit worth working on.

Writing and running a Python script on the terminal

1- Creating a script

Before running a script on the terminal you need to have one. Scripts are written on a text editor such as:

  • Gedit (Ubuntu)
  • Text Edit (Mac)
  • Emacs
  • Vim
  • etc…

Word and Writer (OpenOffice/LibreOffice) are not to be used.

Example of a script (you can copy it to your favorite text editor):


# Script to draw a rectangle

big_side = 10
small_side = 5

print((big_side)*'. ' + '.')

while small_side > 1 :
print('. '+' '*(big_side - 1)+'.')
small_side = small_side -1

print((big_side)*'. ' + '.')

2- Saving the file

The text file containing the script is then saved using the extention ‘.py‘ (meaning the file is written in Python).

3- Running the script on the terminal

On the terminal you will type :

$ python /path/to/script/python-script.py

REMARKS

– the $ symbol is not to be typed, is just there to signalize the command line

– If you were writing in another programming language the logic would be the same. In perl, you would give the extention ‘.pl‘ to your script text file and run on the terminal by typing ‘ $ perl /path/to/script/perl-script.pl ‘

Python versions

My default python version is 2.7.3. To know your python version just type ‘python’ in the terminal and press enter.

I also have Python3.2 installed and to use it I just type:

$ python3.2                                    # to use python on the terminal

$ python3.2 python-script.py          # to run a script

REMARK : The “#” symbol allows you to make comments in a python script. Everything after the “#” is ignored when you execute the program. On this blog I use the symbol merely to make comments on what I write.

There are some synthaxic differences between Python2 and Python3. If you can install and use a Python3 version is it probably better since it’s gradually remplacing the python2 versions.

A great way to start learning Python from scratch

So you want to learn Python…

A good way to get started is to go on the Computer Science Circles of the University of Waterloo (http://cscircles.cemc.uwaterloo.ca/). There you will be guided through concepts and you will be able to solve little problems directly on the site.

You can also save your progress if you register (but there is no need to register to use the site)

The really good thing is that you get to test and experiment. And this is probably the best way to learn programming: put in practice the new concepts.