back to index

POOLSCORE - a program for analysis of case-control genetic association studies using allele frequency measurements on DNA pools

Paul McKeigue

In a DNA pooling design for a genetic association study,  DNA pools are formed from cases and controls, with or without replication of each pool, and SNP allele frequencies in these pools are measured.   The rationale for this design is to reduce assay costs, in comparison with a design in which the individuals are genotyped.  

The efficiency of this design can be measured by the precision (inverse variance) of the pool-based estimates of allele frequencies in case and controls compared with the precision of an estimates based upon genotyping the individuals whose DNA samples were pooled.  This depends upon the measurement variance, the size of the pools (number of individuals per pool), and the number of replicates of each pool.    In a design with pools of N diploid individuals replicated T times, at a SNP with allele frequency p, the efficiency is  



where sigma^2 is the variance of measurements of allele frequency between replicates of a given pool. This is typically about 0.001, equivalent to a standard deviation of 0.03. Thus in a design with no replication and pools of size N = 50 individuls, the efficiency at a locus with allele frequency 0.2 woud be 62%.  The same efficiency could be achieved with a design using pools of size N=100 and T=2 replicates of each pool, but this of course would require the same number of assays.  

One way to test for association is to use a t-test for differences between mean allele frequencies in case and control pools. This has limitations; it ignores the information about sampling variance and assay precision  that is available from the study design: we know the pool sizes, and thus the binomial sampling variance of the allele counts in each pool. With a t-test, there is only one independent observation per pool, and only the mean over  replicates of the measurements on each pool can be used in the analysis; the information about assay precision that is available from the variance between replicates is ignored.  An alternative approach is to use clustering algorithms, as implemented in the GENEPOOL program developed by the Translational Genomics Research Institute, to assess the extent to which the data points at each locus can be separated into case and control clusters.  The clustering score can be used to rank SNPs before selecting which loci to type on individuals, but clustering algorithms do not directly evaluate the strength of evidence in favour of association.     

POOLSCORE uses all information available from the study design to test for association.  The program models the allele counts in each pool, the assay precision, the background signal level and the population allele frequencies as missing data.   Modelling the background signal level accounts for the bias of allele frequency measurements towards 0.5.  

The program uses Markov chain Monte Carlo simulation to generate the posterior distribution of missing data given the observed data (frequency measurements and pool sizes) under the null hypothesis of no association.   This posterior distribution is generated by Markov chain Monte Carlo simulation.  Score tests for association, based on comparing the allele counts in cases and controls, are constructed by averaging over this posterior distribution.   The model assumes that the measurements have a gaussian distribution, given the underlying allele count.  It might be more realistic to model the measurements with a more heavy-tailed distribution such as a t distribution that would give less weight to outlier measurements.   A classical score test (based on the gradient of the log-likelihood as a function of the log odds ratio associated with one extra copy of the allele) is evaluated by averaging over this posterior distribution.  

To rank loci by the strength of evidence in favour of association, it is logical to use Bayes factors rather than p-values.  The Bayes factor, unlike the p-value, takes into account the power of the study to detect an effect of plausible size at each locus under test.  To evaluate the Bayes factor, you have to make some assumptions about what sizes of effect are plausible, and specify these assumptions in the prior.  The score test results can be used to calculate the Bayes factor comparing the evidence for association with the null hypothesis, if a Gaussian prior on the effect size (under the hypothesis that there is an association) is specified.  By default, POOLSCORE calculates the Bayes factor using a Gaussian prior with mean zero and precision 0.1.  Under this prior, the expected absolute size of the log odds ratio is about 0.4: equivalent to an odds ratio of 1.5.  POOLSCORE calculates the log Bayes factor using logs to base 10

Clayton argues that in a genome-wide association study, the prior odds of association with any single gene are of the order of 3000:1 against, and therefore that a log10 Bayes factor of at least 4 is required for the posterior odds to be at least 3:1 in favour of association.  .  
A useful by-product of the score test algorithm is that it yields an estimate of the efficiency of the study design, in comparison to one in which individuals are genotyped.   The efficiency is the ratio of observed information to complete information.  POOLSCORE can be used with simulated data to evaluate the efficiency of various study designs.  Another advantage of score tests is that meta-analyses are straightforward: for each hypothesis tested, we can just sum the score and observed information across studies, and recalculate the test statistic.    

The main disadvantage of using POOLSCORE, in comparison with clustering algorithms, is that computation for a genome-wide dataset with hundreds of thousands of loci is slow: typically requiring about 1 second per locus.  This is not a serious problem if you have access to a task farm with a few hundred processors, as you can easily break up the dataset into blocks of a few thousand loci. 

Instructions

POOLSCORE is an R script (poolscore.R) containing a main function named poolscore(outcome, y, n, outfile, header, burnin, samples)  and other functions that are called by this main function.   Some functions have been written in C for speed: these are in poolscore.c.   This has been compiled for linux on x86_64 architecture as the shared library poolscore.so. To install poolscore on other systems, you can compile the library by typing R CMD SHLIB poolscore.c -o poolscore.so at the command prompt.  

Brief instructions for using the program are given below.    These instructions assume a basic knowledge of how to manipulate data objects and functions in R. 

Preparing the data objects file

POOLSCORE requires the raw data (such as probe intensities from a genotyping chip) to be converted to allele frequency measurements: this may require some correction for bias in the assay, such as the k-correctiondescribed for Affymetrix chips.  It is preferable to sum the intensities for alleles A and allele B before taking the ratio sum(A) / sum(A + B), rather than to take the mean of the ratio A/(A+B) over probe pairs.  To prepare the data, use R to create the objects listed below from your dataset.  These will be passed as arguments to the function poolscore().    

outcome - binary vector of length P, specifying whether the pool consists of cases or controls

locusnames - character vector of length L, specifying the locus names

y - three-way numeric array of dimension P x T x L (where P is number of pools, R is number of replicates of each pool, and L is number of loci) specifying allele frequency measurements.   The number of individuals (the pool size) can vary between pools. Missing values are allowed - so if the number of replicates is not the same for each pool, just use missing values to fill out the array.  The locus names should be assigned to dimnames(y)[[3]]: in other words, dimension 3 of the array should be labelled with the locus names.    

n - integer vector of length P, specifying number of individuals in each pool

Running the script

The script requires the library "combinat".  If not already installed, you can install it from R with the command install.packages("combinat")

Load the script by typing source("poolscore.R")

You can load a small test dataset by typing source("testdata.R").  This dataset consists of allele frequency measurements at 100 autosomal loci on the Affy6 array in 31 pools each measured with 2 replicates.  The objects case, y3way, and n specify case-control status, allele frequency measurements, and number of individuals in each pool  

Run the program by typing results <- poolscore(case, y3way, 2*n, "results.txt", TRUE, 20, 200)

This will store the results in a data frame named results, and also write them to the file "results.txt".  The pool size is specified as the number of gametes: 2*n for diploid individuals.  The fifth argument specifies whether to write a header line in the results file.  The last two arguments to the function poolscoreare the number of samples for burnin and inference respectively: 50 and 250 are adequate for an approximate estimate.  If after the initial run the log10 Bayes factor is greater than 3, the script will automatically continue with a longer run.  

  

Interpreting the results

The results table has ten columns, with one row for each locus

The columns are:- 

  1. score: gradient of the log-likelihood as a function of the effect parameter (log odds ratio for association of the allele with outcome) at the null. 
  2. complete information: this can be interpreted as the information you would have about the effect parameter if you had complete data (allele counts in each pool).  It is defined as minus the second derivative of the log-likelihood function given the complete data, averaged over the posterior distribution of the complete data.   
  3. observed information: minus the expectation of the second derivative of the log-likelihood at the null given the observed data. 
  4. percent missing information attributable to measurement: percent of complete information that was lost because of measurement error
  5. percent information extracted: ratio of observed information to complete information, which can be interpreted as the efficiency of the study design.  The percent information extracted and the percent missing information attributable to measurement error add almost to 100: the remaining missing information is that attributable to uncertainty about model parameters, which is typically less than 1% of the complete information. 
  6. test statistic: this has a standard normal distribution under the null
  7. log Bayes factor: this is the natural logarithm of the Bayes factor comparing the hypothesis of association with the null hypothesis.  This is calculated from the test statistic, the observed information and the prior on the effect size (specified as Gaussian with mean zero and variance 0.1).   
  8. p-value: calculated from standard normal deviate
  9. allele frequency: posterior mean, cases and controls combined
  10. precision: posterior mean of the precision (inverse variance) of the measurement of allele frequency

Future extensions to the program

Stratified analyses
Test for association with an outcome coded as ordered categories.