Tutorial

Global I_r

To test ir, download the example genome of Mycoplasma genitalium into the newly created directory Ir_xxx and uncompress it:

gunzip l43967.fasta.gz

To calculate the I_r for this sequence, enter

./ir -i l43967.fasta

to get

# Len   I_r
580076  0.1388

We can compare this to the I_r value of the shuffled version of the genome by typing

shuffleseq -filter l43967.fasta | ./ir

where shuffleseq is part of the EMBOSS software package [3]. The I_r of the shuffled sequence should be close to zero.

Instead of just computing an I_r for an entire sequence, it is often more informative to look at local variations in I_r. This can be done by carrying out a window analysis using the -w option. In order to visualize the result of such a window analysis, we use the program graph, which is part of the plotutils package. Given a functioning installation of plotutils, try

./ir -w 1000 < l43967.fasta \
| graph -y 0  -X Ir -Y Position -T X

where \ indicates line continuation. Notice the very sharp peaks in the resulting plot corresponding to regions with exceptional repetitiveness. We can smooth the plot by increasing the window size to, say, 10,000:

./ir -w 10000 < l43967.fasta \
| graph -y 0  -X Ir -Y Position -T X

In order to print our plot, generate it first in xfig format, for example:

./ir -w 10000 < l43967.fasta \
| graph -y 0  -X Ir -Y Position -T fig > mg.fig

The resulting file can now be manipulated using the program xfig. Alternatively, we can convert it to postscript by typing

fig2dev -L ps mg.fig > mg.ps

Similarly, encapsulated postscript for inclusion in, e.g., L^ATEX documents can be generated by

fig2dev -L eps mg.fig > mg.eps

Multiple Sequences

Multiple sequences can be analyzed either separately, or combinded. To see the difference between these two modes, compute the I_r of the random sequence supplied together with the source files:

cat ranSeq.fasta | ./ir # Len   I_r
10000   0.0012

If we add another exact copy of this sequence to the analysis, we expect to see a great increase in I_r :

cat ranSeq.fasta ranSeq.fasta | ./ir
# Len   I_r
20000   6.3702

However, if we switch on the separate mode of analysis, we return to the low I_r value of the original isolated sequence:

cat ranSeq.fasta ranSeq.fasta | ./ir -s
# Seq   Len     I_r
>Random Sequence #1; G/C=0.50   10000   0.0012
>Random Sequence #1; G/C=0.50   10000   0.0012

When carrying out a window analysis on multiple sequences it might be useful to write the I_r values of the different sequences into distinct files. This can be done using the UNIX tool csplit. For our biologically rather meaningless example we can execute

cat ranSeq.fasta ranSeq.fasta| ./ir -w 100 | csplit -f dataset -z - /\>/ {1}

to find the I_r values for the first sequence in dataset01 and those for the second sequence in dataset02.

Tutorial

Global Ir

Window Analysis

Multiple Sequences

Global I_r