Paul McKeigue
In a DNA pooling design for a genetic association study, DNA pools are formed from cases and controls, with or without replication of each pool, and SNP allele frequencies in these pools are measured. The rationale for this design is to reduce assay costs, in comparison with a design in which the individuals are genotyped.
The efficiency of this design can be measured by the precision (inverse variance) of the pool-based estimates of allele frequencies in case and controls compared with the precision of an estimates based upon genotyping the individuals whose DNA samples were pooled. This depends upon the measurement variance, the size of the pools (number of individuals per pool), and the number of replicates of each pool. In a design with pools of N diploid individuals replicated T times, at a SNP with allele frequency p, the efficiency is
POOLSCORE uses all information available from the study design
to
test for association. The program models the allele counts in
each pool, the assay
precision, the background signal level and the
population allele frequencies as missing data. Modelling the
background signal level accounts for the bias of allele frequency
measurements towards 0.5.
The program uses Markov chain Monte Carlo simulation to generate the
posterior
distribution of
missing data given the observed data (frequency measurements and pool
sizes)
under the null hypothesis of no association. This
posterior
distribution is generated by Markov chain Monte Carlo
simulation. Score
tests for association, based on comparing the allele counts in
cases and
controls, are constructed by averaging over this posterior
distribution. The model assumes that the
measurements have
a gaussian
distribution, given the underlying allele count. It might be
more
realistic to model the measurements with a more
heavy-tailed
distribution such as a t
distribution that would give
less weight to outlier
measurements. A classical score test (based on the gradient
of
the log-likelihood as a function of the log odds ratio associated with
one extra copy of the allele) is evaluated by averaging over this
posterior distribution.
To rank loci by the strength of evidence in favour of association, it is logical to use Bayes factors rather than p-values. The Bayes factor, unlike the p-value, takes into account the power of the study to detect an effect of plausible size at each locus under test. To evaluate the Bayes factor, you have to make some assumptions about what sizes of effect are plausible, and specify these assumptions in the prior. The score test results can be used to calculate the Bayes factor comparing the evidence for association with the null hypothesis, if a Gaussian prior on the effect size (under the hypothesis that there is an association) is specified. By default, POOLSCORE calculates the Bayes factor using a Gaussian prior with mean zero and precision 0.1. Under this prior, the expected absolute size of the log odds ratio is about 0.4: equivalent to an odds ratio of 1.5. POOLSCORE calculates the log Bayes factor using logs to base 10
Clayton
argues that in a genome-wide association study, the prior odds of
association with any single gene are of the order of 3000:1 against,
and therefore that a log10 Bayes factor of at least 4
is required for the posterior odds to be at least 3:1 in
favour of
association. .
A useful by-product of the score test algorithm is that it
yields an estimate
of the efficiency of the study design, in comparison to one in
which individuals are genotyped. The
efficiency is the
ratio of observed information to complete information.
POOLSCORE
can be used with simulated data to evaluate the efficiency of various
study designs. Another advantage
of score tests is that meta-analyses are straightforward: for
each
hypothesis tested, we can just sum the score and observed information
across studies, and recalculate the test statistic.
POOLSCORE is an R script (poolscore.R)
containing a main function named poolscore(outcome, y,
n, outfile, header, burnin, samples)
and other functions that are called by this main function.
Some functions have been written in C for speed: these are in poolscore.c. This has been compiled for linux on x86_64 architecture as the shared library poolscore.so. To install poolscore on other systems, you can compile the library by typing R CMD SHLIB poolscore.c -o poolscore.so at the command prompt.
Brief instructions for using the program are given below. These instructions assume a basic knowledge of how to manipulate data objects and functions in R.
POOLSCORE requires the raw data (such as probe intensities
from a
genotyping chip) to be converted to allele frequency measurements: this
may require some correction for bias in the assay, such as the
k-correctiondescribed
for Affymetrix chips. It is preferable to sum the
intensities for alleles A and allele B before taking the ratio sum(A) /
sum(A + B), rather than to take the mean of the ratio A/(A+B) over
probe pairs. To prepare the data,
use R to create the objects listed below from
your dataset. These will be passed as arguments to the
function poolscore().
outcome -
binary vector of length P, specifying
whether the pool consists of cases or controls
locusnames
- character vector of length L, specifying the locus names
y
- three-way numeric array of
dimension P x T x L (where P is number of pools, R is number of
replicates of
each pool, and L is number of loci) specifying allele frequency
measurements. The number of individuals (the pool size) can
vary between pools. Missing values are
allowed - so if the number of replicates is not the same for each
pool, just use missing values to fill out the array.
The locus names should be assigned to dimnames(y)[[3]]:
in other words, dimension 3 of the array should be labelled with the
locus names.
The script requires the library "combinat".
If not
already
installed, you can install it from R with the command install.packages("combinat").
Load the script by typing source("poolscore.R")
You can load a small test dataset by typing source("testdata.R").
This dataset consists of allele frequency measurements at
100 autosomal loci on the Affy6 array in 31 pools each measured with 2
replicates. The objects case, y3way, and n specify case-control status, allele frequency measurements, and number of individuals in each pool
Run the program by typing
results <- poolscore(case, y3way,
2*n, "results.txt", TRUE, 20, 200)
This will store the results in a data frame named results, and also write them to the file "results.txt". The pool size is specified as the number of gametes: 2*n for diploid individuals. The fifth argument specifies whether to write a header line in the results file. The
last two arguments to the function poolscoreare
the number of samples for burnin and inference respectively: 50
and 250 are adequate for an approximate estimate. If after the
initial run the log10 Bayes factor is greater than 3, the script will automatically continue with a longer run.
The results table has ten columns, with one row for each locus
The columns are:-