MCALIGN: alignment of noncoding DNA sequences based on explicit models of indel evolution

Download MCALIGN version 1.05 / MCALIGN2 version 1.02 - executables for Linux, Mac OS X and Windows; program source files.

Reliable alignment of noncoding DNA is difficult due to the unknown pattern of insertion/deletion (indel) events, as illustrated by the following two alternative alignments between a pair of sequences:

Alignment-1

ATA----CAG
AGCTAAGCCG

Alignment-2

A--TA--CAG
AGCTAAGCCG

The number of nucleotide differences (shown in red) differs radically between the alternative alignments, but without other information it is impossible to say which is the more plausible. MCALIGN tackles this problem by aligning sequences using an explicit model of indel evolution, and finds the most probable alignment for a pair of sequences, or three sequences of know phylogenetic relationship. The indel evolution model has two parameters: (1) theta, the frequency indels relative to the frequency of nucleotide substitution; (2) w, a vector of relative frequencies of indels of different length. Models for some taxa are supplied, or a new model can be set up by the user. We have developed two related methods that generate two-or three-way alignments. Please cite Keightley and Johnson (2004) or Wang et al. (2006) if you publish work that uses MCALIGN.

Sequence input file format

The sequence input file format is FASTA, as follows:

>Species 1
GTAAGGAAATCGTACTTTTAAATGGACTTAGTTAAATCATTACTTTTATAG
>Species 2
GTAAGAAAATGCGCACTTATAGTTGAACTTAACACATAACTTTTATAG

Any gaps, denoted by '-', are removed from the sequences after they are read in.

Predefined models of sequence evolution

An input file, indelfreq.preconfig.dat, specifies models of indel evolution. Currently, models for Drosophila, rodent and hominid noncoding DNA are provided. For alignment of noncoding DNA of other taxa a model needs to be supplied by the user (see below).

Running MCALIGN2

On Linux and Mac OS X, open a terminal, and change to the directory containing the downloaded files. Assume that $ is the system prompt and the executable file is called mcalign2. Give the executable execute permission by issuing the command:

                   $ chmod +x mcalign2

To run the program issue the command

$ ./mcalign2

On Windows, it is best to open a window with the command (MS DOS) prompt in the folder containing the program, and to run the program from the command line. Not so good is to go to 'run' in the start menu or to double click on the executable file, because the window disappears after the run completes.

The program starts up by displaying the version number, then gives the following prompt:

The following indel frequency models have been defined:
 1  Drosophila intronic DNA
 2  Drosphila noncoding DNA 5' of genes within 500bp
 3  Drosphila noncoding DNA 3' of genes within 500bp
 4  Rodent intronic DNA
 5   Hominid intronic DNA
99  Display license
Which model should the program use?

Respond according to what model is required.

The program then asks for an input file name:

Input file (in Fasta format):

The program then asks which model of nucleotide substitution is to be used:

Model of nucleotide substitution?

1. Jukes-Cantor

2. Kimura-2-Parameter

Choice:

If the Kimura-2-Parameter model is chosen, the program then asks for the value of the transition/transversion ratio to assume.

Transition/transversion ratio:

A value of 2 is typical for many DNA sequences.

The program will then proceed to find the alignment of highest probability, according to the model specified.

The alignment is output to inputfile.out, where inputfile is the name of the input file specified by the user.

There is a system dependent limit for the length of a sequence that can be aligned. It this limit is exceeded, the program will abort with a memory allocation failure.

Running MCALIGN

Start the program up, as described above for MCALIGN2. The first prompt is to ask for an indel frequency model, also as described above for MCALIGN2.

The program then asks:

Number of sequences to align [2 or 3]

If 3 is entered, the program will attempt to read in 3 sequences from the input file and attempt to align them by parsimony. The third sequence in the file is assumed to be the outgroup. In this case, some additional parameters are needed:

Do you know the ratio of branch lengths of ingroups:outgroup? [1=yes,0=no]

If 0 is input then the rate of evolution in the outgroup sequence relative to the ingroups (lambda) is assumed to be unknown, and the program does not use information on lambda in calculating alignment probability. If information on lambda is known by the user, and 1 is input, the program asks for the value of lambda:

Ratio of outgroup length to total ingroup length

Currently, only a value of 1 is acceptable, since the algorithm to calculate probability for values other than 1 has not been implemented.

The program will then ask for the number of iterations to be carried out:

Number of iterations [0=default]

Each iterations tries out a new proposal alignment. It is recommended that the default number of iterations [0] is selected.

The program then asks for an input file name:

Name of input file

Finally, the program asks if a rapid alignment should be done. Normally, the slow accurate alignment option should be selected.

Generate a slow accurate [=0] or rapid inaccurate [=1] alignment?

What happens during a run. Initial alignments and probability maximizations using several internal alignment parameter values are carried out. The final maximization starts from the highest probability alignment found in the initial alignments.

MCALIGN Results output files. The file mcalign.ml.out contains (one of) the most probable (MP) alignment(s). The file mcalign.out contains the MP alignment(s) and all alignments with probability >1% of it. Alignment probability is given relative to that for the MP alignment. The file is unordered. The file mcalign.diag.out contains run diagnostics.

Supplying a user defined model of sequence evolution

Parameters for a user-defined model normally come from sequence data of species related to the species of interest, that are sufficiently closely related as to make alignment by standard heuristic methods unambiguous. If the user supplies an optional file named indelfreq.user.dat, this is recognised by the program as a user defined indel evolution model. For example, consider a model for Drosophila intronic DNA. The file indelfreq.user.dat would have the following format:

0 Drosophila intronic DNA

0.225

40

0.45454500 0.18181800 0.05297188 0.04527511 0.03869668

0.03307408 0.02826845 0.02416106 0.02065048 0.01764999

0.01508546 0.01289355 0.01102013 0.00941892 0.00805036

0.00688065 0.00588089 0.00502640 0.00429607 0.00367186

0.00313834 0.00268234 0.00229260 0.00195949 0.00167477

0.00143143 0.00122345 0.00104568 0.00089374 0.00076388

0.00065289 0.00055803 0.00047695 0.00040765 0.00034842

0.00029779 0.00025452 0.00021754 0.00018593 0.00015892

The file starts with zero, then a string descriptor of the model. On the next line is the value of theta (0.225), the rate of indels relative to nucleotide substitutions. The next number (40) is the length of the vector, w, specifying the relative frequencies of indels of length i. Then comes a list of the elements of w, which should sum to 1.

Dealing with long sequences

Execution time increases nonlinearly with sequence length for both MCALIGN and MCALIGN2, and memory usage is quadratic in sequence length for MCALIGN2. It is therefore recommended to align sequences of only up to a few kb long. For long contigs, some other heuristic method (e.g., CLUSTAL, DIALIGN or AVID - see Keightley and Johnson 2004 for references) can be used to create an initial alignment that can then be aligned in segments of, say, 1kb by MCALIGN/MCALIGN2.

Compiling the programs

This may be necessary if the executables provided do not run on your system. The following refers to a Unix system; compiling the programs on other systems may be somewhat different.

gcc -O3 -o mcalign mcalign.c seqalign.c -lm

The -O3 option specifies that the object code should be optimised for increased speed; compilers other than gcc may use a different option. The -o option specifies the name of the executable (object) file, mcalign in this case.

g++ mcalign2.cpp -O3 -o mcalign2 -lm

Bug reports

If you discover a bug, please email details to This is an image of Peter Keightley's email address