MLGENOMEU - estimation of mutation parameters from mutation accumulation experiments

Notes for Version 2.08
For a detailed description of the methods used in the program, see the following papers.

The initial Monte Carlo-based algorithm is described in:

Keightley, P. D. (1994). The distribution of mutation effects on viability in Drosophila melanogaster. Genetics 138: 1315-1322.

The numerical integration algorithm used by this program is a development of the above, described in:

Keightley, P. D. and Ohnishi, O. (1998). EMS-induced polygenic mutation rates for nine quantitative characters in Drosophila melanogaster. Genetics 148: 753-766. [Article]

Simulation study:

Keightley, P. D. (1998). Inference of genome wide mutation rates and distributions of mutation effects for fitness traits: a simulation study. Genetics 150: 1283-1293. [Article]

The following instructions are written with a Unix system in mind; running the program on other systems may differ somewhat from what follows.

Obtaining the program

The program contains copyright routines, used under license, and unfortunately it has not been possible to offer download of the program from this website. The program must instead be requested by contacting the author by email:  . Files associated with the program can be downloaded from this website (see below). One of these files, gengamtab.out, is a required file,

Compilation of the Program

The program is written in C, and consists of only the one module:

To compile the program using, for example, the gcc compiler on a Unix system:

gcc -O2 -o mlgenomeu mlgenomeu.c -lm

The -O2 option specifies that the object code should be optimised for increased speed; compilers other than gcc may use a different option to specify this. The -o option specifies the name of the executable (object ) file, mlgenomeu in this case.

Input File Formats

The program expects an input file, data.dat.  The first entry in the file is the generation number of the mutation accumulation (MA) lines. In this program, only one MA generation (+ control = generation 0)  can be handled.  The input file contains a list of phenotypic observations for the MA line replicates and the control line replicates.  Each observation is the phenotypic value of one replicate for an MA or control line. The control line replicate values are assumed to follow a normal distributed.  No additional random effects are estimated by this version of the program, but fixed effects may be included by inserting an appropriate code into the input data file.  All replicates are assumed to be measured with the same error.  MA lines have at least one replicate, and the number of replicates per MA line can vary.  The individual MA lines are given a unique integer identifier (>0), but replicate numbers within a line are not given codes.  The phenotypic value for the trait is a real number.  Control lines are given the identifier `0'. An example of  part of a simulated data.dat file is as follows. The MA lines (50 of them, each with 4 replicates) come first. This is followed by a block of control line replicaes. Note that the generation number of the MA lines is 1 in this case.

1 9.735818 1 9.079754 1 10.793222 1 10.062783
2 11.842031 2 10.241879 2 10.206810 2 9.931010
3 8.810917 3 8.723238 3 10.741316 3 10.449603
4 7.858996 4 10.354959 4 9.068551 4 9.349407
49 10.538285 49 9.587561 49 10.778127 49 8.931881
50 11.384082 50 9.507858 50 8.633099 50 10.235791

0 8.934888 0 10.686638 0 9.251242 0 11.910901
0 9.242428 0 8.849134 0 9.418149 0 10.984947
0 9.951213 0 8.866559 0 9.243635 0 8.940020
0 12.175585 0 9.282169 0 9.776285 0 8.819433
0 12.093757 0 9.640237 0 11.307031 0 9.133931
0 9.807303 0 10.327360 0 9.346597 0 11.193762
0 10.825930 0 8.457343 0 10.121449 0 9.166584

Download complete example file data.dat.

A downloadable file (zipped) gengamtab.out.gz, containing tables of gamma distributions used by the numerical integration procedure, is required. This file should not be changed in any way. In order to start downloading these files, it may be necessary to hold the shift key down while clicking on the file.

Running the Program and its Input Parameter Values

The program is set up to generate profile likelihoods, in which one of the parameters is kept at a fixed value.  Convergence over the full likelihood surface is often a problem, so maximization with all of the key parameters of the model as variables is discouraged.

IMPORTANT. Runs should be checked with different starting values, in order to check convergence, and to check for the presence of multiple peaks.

There is a `line search' option to search over fixed values of a specific parameter, and this may be used to increase the robustness of the search, but it takes more CPU time. The way in which line search is used is explained below.

The recommended way to run the program is to estimate U and E(a) for a series of different values of beta ranging from ->infinity (equal effects - the -99 beta option) to, say, 0.1.  Within each model (i.e. value for beta), confidence intervals for U and E(a) can be computed from profile likelihoods. The fit of different models can be compared via the likelihood.

The summary output is written to the file sum.out.  Full details of the run are stored in mlgenomeu.out.

Below, input values are shown in red for a typical run in which beta is fixed at 1.0.  The starting value for alpha is 10.0 giving a mean mutant effect of beta/alpha = 0.1 (this obviously has to be chosen to give a reasonable fit to the data).  This run inputs a value for the proportion of mutants positive = zero, i.e.  all mutations are deleterious.  If P is set to >0, the program slows down dramatically.  Here is an example of the start of a run with suitable input values. This was run using the example file data.dat that can be downloaded above. Comments on the input values are given in green.

$ mlgenomeu

Opened file sum.out for write, append mode.
Enter no. fixed effect to be estimated 0
    See below for a description of how to incorporate fixed effects.
Enter 1 if exactly 1 mutation per line assumed (TE model) 0
    If 1 is selected here, the model assumes that each MA line contains exactly one mutation (e.g. a TE insertion)
Parameter beta (-99 for equal effects) :
Enter 1/0 if variable/not variable during likelihood maximization 0
    It is recommended that a zero is entered here. Comparison of different models (beta values) can be done by comparing their likelihoods.
Enter starting value for beta (-99 for equal effects)  1
    Beta is the shape parameter of the gamma distribution; -99 implies beta -> infinity.
Parameter U :
Enter 1/0 if variable/not variable during likelihood maximization 1
    Normally, enter 1 to maximize over the mutation rate, unless you are interested in the likelihood for a specific value of U in computing e.g. profile likelihood
Enter starting value for U  0.5
Enter 1 to input E(a) as starting value, 0 to input alpha 0
    The program uses a variable parameter alpha as the scale parameter of the distribution of mutational effects. For convenience, the user can input a starting value of the mean mutational effect E(a), which relates to alpha by E(a) = beta/alpha.
Parameter alpha:
Enter 1/0 if variable/not variable during likelihood maximization 1
    Normally enter 1 here, unless carrying out a profile likelihood.
Enter starting value for alpha 10.0
Parameter proportion of positive mutational effects:
Enter 1/0 if variable/not variable during likelihood maximization 0
    Normally enter 0
Enter starting value for proportion of positive mutational effects 0
Enter convergence criterion (for default 1.0e-9, enter 0) 0
    Normally enter 0
Enter 1 if perform line search on one parameter 0
    Running a line search is described below
Enter 0 to compute Mean + Ve starting values from control line data,
 1 to input manually (normally answer 0) 0
    Starting values for the population mean and Ve are usually calculated from the control line data, but values can also be input to assist in exploring the likelihood surface
Opened file data.dat for read.
Generation number 1
Control mean 9.976927, variance 1.095078
No. MA lines 50, mean 10.013337
Enter 1 if control/MA have separate environmental variances
 (must enter 0 if MA lines unreplicated) 0
    This option can be used if there is reason to believe that Ve differs between the control and MA lines. To have power to estimate separate variances, the MA lines must be replicated.
Opened file gengamtab.out for read.
mlgenomue version x.yy
No. parameters 4 No. evaluations 10000
  1 b 1.00000 U 0.50000 a 10.00000 M 9.98 V 1.09508 LogL -440.38656
  2 b 1.00000 U 0.60000 a 10.00000 M 9.98 V 1.09508 LogL -440.55183

Fixed effects

These are coded after the line number in data.dat. For a fixed effect with n levels, they are coded from zero to n-1.

Line search option

This option allows one of the parameters to have a series of fixed values, and likelihood is maximized with respect to the other variable parameters in a series of separate runs. At the end of these runs, the line search parameter is changed to a variable parameter, with a starting value equal to that which gave the highest likelihood in the series of fixed value runs. Using the line search option can reduce the dimensionality of the search, so can help in finding the global likelihood maximum if their are multiple peaks in the likelihood surface.

Bug reports

Please email any bug report to