Professor of Genetic Epidemiology and
Statistical Genetics, University of Edinburgh

also Honorary
Consultant in Public Health, NHS Lothian

This is my personal web page. Larger files, including software packages and public datasets, can be found on my research group’s home page on my server.

For the Usher
Institute, information about research is on this
page, information about taught postgraduate courses is on this
page, and information about PhDs is on this
page.

In the Usher Institute I am a member of the Molecular
Epidemiology Group. I am a member of the Executive Group of the
Centre
for Statistics based in the School of Mathematics. From April
2019 I am on the Medical Research Councils’s Methodology
Research Programme Panel

Usher Institute of Population Health
Sciences and Informatics

University of Edinburgh Medical
School,

Teviot Place, Edinburgh EH8 9AG

Phone +44 131 650
4556

If you need to send me confidential material by email, you can use
my PGP public key (obtained by searching for my email address on a
PGP key server such as https://pgp.mit.edu)

for
which the fingerprint is

683E 7E3B E8B3 83BB 8F80 363A A034 3F3B B2D6 769A

For best practice, you should confirm the key fingerprint with me in person or by video link before using it.

Alternatively, you can send messages to my ProtonMail address shown above, but this is secure only if you are sending from another ProtonMail account

If you need to transfer large data files securely, I can set up an SFTP account for you on my server. You will need to use SSH public key authentication. Instructions for setting this up on a Windows PC are here.

My research focuses on methods for molecular and genetic epidemiology, with applications in clinical prediction and personalized medicine. These methods make use of Bayesian and computationally-intensive statistical methods, and machine learning methods for constructing predictors. I work closely with Helen Colhoun’s research group at the Centre for Genomic and Experimental Medicine. This collaboration includes the development of an analysis platform based on deidentified electronic health records and the use of this platform to study drug safety and complications of diabetes.

My group’s current research includes

Construction of predictive models for drug response in rheumatoid arthritis in two collaborative studies: the MRC-funded MATURA consortium and the Scottish Early Rheumatoid Arthritis cohort

Development of biomarker-based predictions of diabetic complications in the SUMMIT European consortium and the Scottish Diabetes Research Network Type 1 Bioresource

Development of a platform (GENOSCORES) for constructing genotypic predictors from summary results of genome-wide association studies

The relation of C-peptide persistence in Type 1 diabetes to the genetic architecture of Type 1 and Type 2 diabetes

Development of deep learning algorithms for prediction of diabetic complications from retinal images

Development of statistical methods for learning to classify with high-dimensional biomarker panels and quantifying the incremental contribution of each biomarker or panel to predictive performance.

Risk stratification for colorectal cancer (with Evropi Theodoratou). I am also planning a new project to study the relationship of colorectal cancer risk to the colonic microbiome profile, based on establishing a consented bioresource of faecal samples from the Scottish bowel screening programme

A list of my research grants is here. In addition Helen Colhoun and I have recently been awarded 1.4 million euros from the EU for the University of Edinburgh as a partner in the Hypo-RESOLVE project studying hypoglycaemia and its impact in diabetes

Quantifying
performance of a diagnostic test as the expected information for
discrimination: relation to the C-statistic. (*Statistical
Methods for Medical Research* 2018. This paper proposes that the
expected information for discrimination (expected weight of evidence)
should supplant the C-statistic (area under the ROC curve) for
quantifying the performance of a diagnostic test or risk predictor,
and for evaluating the incremental contribution of a new biomarker.
This page
demonstrates the statistical methods on three publicly available
datasets, and links to R scripts for running these analyses

This paper has had wide media coverage because it draws on unpublished results obtained by Alan Turing while working on the Banburismus procedure at Bletchley Park, later described and extended by his assistant Jack Good. This poster describes the ideas.

Other publications for which the final accepted manuscript is available under open access can be found by searching Edinburgh Research Explorer.

These pages include slide presentations, together with tutorials and notes prepared when advising colleagues, students and other researchers

Quantifying predictive performance using the distributions of weight of evidence

Methodology and misunderstandings in precision medicine

Using GWAS summary statistics to construct polygenic scores for hypothesis testing and prediction

Statistical methods for learning to classify with biomarker panels: insights from cryptanalysis

Weighing evidence in an information war – Ethics Forum, Transatlantic Seminar series, School of Social and Political Sciences

Hensor EMA, McKeigue P, Ling SF, Colombo M, Barrett JH, Nam JL, Freeston J, Buch MH, Spiliopoulou A, Agakov F, Kelly S, Lewis MJ, Verstappen SMM, MacGregor AJ, Viatte S, Barton A, Pitzalis C, Emery P, MATURA Consortium, IACON Consortium, PEAC Consortium, Conaghan PG, Morgan AW. Validity of a 2-component imaging-derived disease activity score for improved assessment of synovitis in early rheumatoid arthritis. Rheumatology 2019, in press.

Gurnaghan SJ, Brierley L, Caparrotta TM, McKeigue PM, Blackbourn LAK, Wild SH, Leese GP, McCrimmon RJ, McKnight JA, Pearson ER, Petrie JR, Sattar N, Colhoun HM, on behalf of Scottish Diabetes Research Network Epidemiology Group. The Effect of dapagliflozin on glycaemic control and other cardiovascular disease risk factors in type 2 diabetes mellitus patients: a real-world observational study. Diabetologia 2019, in press.

Colombo M, Looker HC, Farran B, Hess S, Groop L, Palmer CNA, Brosnan MJ,Dalton RN, Wong M, Turner C, Ahlqvist E, Dunger D, Agakov F, Durrington P,ivingstone S, Betteridge J, McKeigue PM, Colhoun HM; SUMMIT Investigators. Serum kidney injury molecule 1 and β(2)-microglobulin perform as well as largerbiomarker panels for prediction of rapid decline in renal function in type 2 diabetes. Diabetologia. 2018 Oct 5. doi: 10.1007/s00125-018-4741-9. [Epub ahead of print] PubMed PMID: 30288572.

Cherlin S, Plant D , Taylor JC, Colombo M, Spiliopoulou A , Tzanis E , Morgan AW, Barnes MR , McKeigue P, Barrett JH, Pitzalis C, Barton A, MATURA Consortium, Cordell HJ. Prediction of treatment response in rheumatoid arthritis patients using genome-wide SNP data. Genetic Epidemiology 2018: Dec;42(8):754-771. doi: 10.1002/gepi.22159

Massey J, Plant D, Hyrich K, Morgan AW, Wilson AG, Spiliopoulou A, Colombo M, McKeigue P, Isaacs J, Cordell H, Pitzalis C, Barton A; BRAGGSS, MATURA Consortium. Genome-wide association study of response to tumour necrosis factor inhibitor therapy in rheumatoid arthritis. Pharmacogenomics J. 2018 Aug 31. doi: 10.1038/s41397-018-0040-6. [Epub ahead of print] PubMed PMID: 30166627.

McKeigue P. Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the C-statistic. Stat Methods Med Res. 2018 Jan 1:962280218776989. doi: 10.1177/0962280218776989. [Epub ahead of print] PubMed PMID: 29978758.

Colombo M,, Looker HC, Farran B, Agakov F, Brosnan MJ, Welsh P, Sattar N, Livingston SJ, Durrington PN, Betteridge DJ, McKeigue PM, Colhoun HM. Apolipoprotein CIII and N-Terminal Prohormone B-type Natriuretic Peptide as Independent Predictors for Cardiovascular Disease in Type 2 Diabetes. Atherosclerosis 2018, May 21;274:182-190

van Zuydam NR, Ahlqvist E, Sandholm N, Deshmukh H, Rayner NW, Abdalla M, Ladenvall C, Ziemek D, Fauman E, Robertson NR, McKeigue PM, Valo E, Forsblom C, Harjutsalo V; FINNDIANE Study centres, Perna A, Rurali E, Marcovecchio ML, Igo RP Jr, Salem RM, Perico N, Lajer M, Käräjämäki A, Imamura M, Kubo M, Takahashi A, Sim X, Liu J, van Dam RM, Jiang G, Tam CHT, Luk AOY, Lee HM, Lim CKP, Szeto CC, So WY, Chan JCN; Hong Kong Diabetes Registry TRS Project Group, Ang SF, Dorajoo R, Wang L, Hua Clara TS, McKnight AJ, Duffy S; Warren 3/UK GoKinD Study Group, Pezzolesi MG, Consortium G, Marre M, Gyorgy B, Hadjadj S, Hiraki LT; DCCT/EDIC group, Ahluwalia TS, Almgren P, Schulz CA, Orho-Melander M, Linneberg A, Christensen C, Witte DR, Grarup N, Brandslund I, Melander O, Paterson AD, Tregouet D, Maxwell AP, Lim SC, Ma RCW, Tai ES, Maeda S, Lyssenko V, Tuomi T, Krolewski AS, Rich SS, Hirschhorn JN, Florez JC, Dunger D, Pedersen O, Hansen T, P, Remuzzi G; SUMMIT Consortium, Brosnan MJ, Palmer CNA, Groop PH, Colhoun HM, Groop LC, McCarthy MI. A Genome-Wide Association Study of Diabetic Kidney Disease in Subjects With Type 2 Diabetes. Diabetes. 2018 Apr 27. pii: db170914. doi: 10.2337/db17-0914. PubMed PMID: 29703844.

Morgan A, Taylor J, Bongartz T, Massey J, Mifsud B, Spiliopoulou A, Scott I, Wang J, Morgan M, Plant D, Colombo M, Orchard P, Twigg S, McInnes I, Porter D, Freeston J, Nam J, Cordell H, Isaacs J, Strathdee J, Arnett D, de Hair M, Tak P, Aslibekyan S, Padyukov L, Bridges SL Jr, Pitzalis C, Cope A, Verstappen S, Emery P, Barnes M, Agakov F, McKeigue P, Mushiroda T, Kubo M, Weinshilboum R, Barton A, Barrett J. Genome-wide Association Study of Response to Methotrexate in Early Rheumatoid Arthritis Patients. Pharmacogenomics Journal 2018, May 25. doi: 10.1038/s41397-018-0025-5.

Meng X, Spiliopoulou A, Timofeeva M, Wei W-Q, Gifford A, Shen X, He Y, Varley T, McKeigue P, Tzoulaki I, Wright A F, Joshi P, Denny J C, Campbell H, Theodoratou E. MR-PheWAS: exploring the causal effect of SUA level on multiple disease outcomes by using genetic instruments in UK Biobank. Ann Rheum Dis 2018; in press. doi:10.1136.

McKeigue P. Sample size requirements for learning to classify with high-dimensional biomarker panels. Stat Methods Med Res. 2018 Jan 1:962280217738807. doi: 10.1177/0962280217738807. [Epub ahead of print] PubMed PMID: 29179643.

Pirastu N, Joshi PK, de Vries PS, Cornelis MC, McKeigue PM, Keum N, Franceschini N, Colombo M, Giovannucci EL, Spiliopoulou A, Franke L, North KE, Kraft P, Morrison AC, Esko T, Wilson JF. GWAS for male-pattern baldness identifies 71 susceptibility loci explaining 38% of the risk. Nat Commun. 2017 Nov 17;8(1):1584. doi: 10.1038/s41467-017-01490-8. PubMed PMID: 29146897.

Bermingham ML, Colombo M, McGurnaghan SJ, Blackbourn LAK, Vučković F, Pučić Baković M, Trbojević-Akmačić I, Lauc G, Agakov F, Agakova AS, Hayward C, Klarić L, Palmer CNA, Petrie JR, Chalmers J, Collier A, Green F, Lindsay RS, Macrury S, McKnight JA, Patrick AW, Thekkepat S, Gornik O, McKeigue PM, Colhoun HM; SDRN Type 1 Bioresource Investigators.. N-Glycan Profile and Kidney Disease in Type 1 Diabetes. Diabetes Care. 2017 Nov 16. pii: dc171042. doi: 10.2337/dc17-1042. [Epub ahead of print] PubMed PMID: 29146600.

Farran B, McGurnaghan S, Looker HC, Livingstone S, Lahnsteiner E, Colhoun HM, McKeigue PM. Modelling cumulative exposure for inference about drug effects in observational studies. Pharmacoepidemiol Drug Saf. 2017 Oct 12. doi:10.1002/pds.4327. [Epub ahead of print] PubMed PMID: 29024286.

Bell S, Farran B, McGurnaghan S, McCrimmon RJ, Leese GP, Petrie JR, McKeigue P, Sattar N, Wild S, McKnight J, Lindsay R, Colhoun HM, Looker H. Risk of acute kidney injury and survival in patients treated with Metformin: an observational cohort study.BMC Nephrol. 2017 May 19;18(1):163. doi: 10.1186/s12882-017-0579-5. PubMed PMID: 28526011.

Spiliopoulou A, Colombo M, Orchard P, Agakov F, McKeigue P. GeneImp: Fast Imputation to Large Reference Panels Using Genotype Likelihoods from Ultralow Coverage Sequencing. Genetics. 2017 May;206(1):91-104. doi: 10.1534/genetics.117.200063. Epub 2017 Mar 27. PubMed PMID: 28348060.

Quell JD, Römisch-Margl W, Colombo M, Krumsiek J, Evans AM, Mohney R, Salomaa V, de Faire U, Groop LC, Agakov F, Looker HC, McKeigue P, Colhoun HM, Kastenmüller G. Automated pathway and reaction prediction facilitates in silico identification of unknown metabolites in human cohort studies. J Chromatogr B Analyt Technol Biomed Life Sci. 2017 Apr 4. pii: S1570-0232(17)30568-8. doi: 10.1016/j.jchromb.2017.04.002. [Epub ahead of print] PubMed PMID: 28479069.

Sandholm N, Van Zuydam N, Ahlqvist E, Juliusdottir T, Deshmukh HA, Rayner NW, Di Camillo B, Forsblom C, Fadista J, Ziemek D, Salem RM, Hiraki LT, Pezzolesi M, Trégouët D, Dahlström E, Valo E, Oskolkov N, Ladenvall C, Marcovecchio ML, Cooper J, Sambo F, Malovini A, Manfrini M, McKnight AJ, Lajer M, Harjutsalo V, Gordin D, Parkkonen M; FinnDiane Study Group, Jaakko Tuomilehto., Lyssenko V, McKeigue PM, Rich SS, Brosnan MJ, Fauman E, Bellazzi R, Rossing P, Hadjadj S, Krolewski A, Paterson AD; DCCT/EDIC Study Group, Jose C. Florez., Hirschhorn JN, Maxwell AP; GENIE Consortium, David Dunger., Cobelli C, Colhoun HM, Groop L, McCarthy MI, Groop PH; SUMMIT Consortium.. The Genetic Landscape of Renal Complications in Type 1 Diabetes.J Am Soc Nephrol. 2017 Feb;28(2):557-574. doi: 10.1681/ASN.2016020231. Epub 2016 Sep 19. PubMed PMID: 27647854.

Postmus I, Warren HR, Trompet S, Arsenault BJ, Avery CL, Bis JC, Chasman DI, de Keyser CE, Deshmukh HA, Evans DS, Feng Q, Li X, Smit RA, Smith AV, Sun F, Taylor KD, Arnold AM, Barnes MR, Barratt BJ, Betteridge J, Boekholdt SM, Boerwinkle E, Buckley BM, Chen YI, de Craen AJ, Cummings SR, Denny JC, Dubé MP, Durrington PN, Eiriksdottir G, Ford I, Guo X, Harris TB, Heckbert SR, Hofman A, Hovingh GK, Kastelein JJ, Launer LJ, Liu CT, Liu Y, Lumley T, McKeigue PM, Munroe PB, Neil A, Nickerson DA, Nyberg F, O'Brien E, O'Donnell CJ, Post W, Poulter N, Vasan RS, Rice K, Rich SS, Rivadeneira F, Sattar N, Sever P, Shaw-Hawkins S, Shields DC, Slagboom PE, Smith NL, Smith JD, Sotoodehnia N, Stanton A, Stott DJ, Stricker BH, Stürmer T, Uitterlinden AG, Wei WQ, Westendorp RG, Whitsel EA, Wiggins KL, Wilke RA, Ballantyne CM, Colhoun HM, Cupples LA, Franco OH, Gudnason V, Hitman G, Palmer CN, Psaty BM, Ridker PM, Stafford JM, Stein CM, Tardif JC, Caulfield MJ, Jukema JW, Rotter JI, Krauss RM. Meta-analysis of genome-wide association studies of HDL cholesterol response to statins. J Med Genet. 2016 Dec;53(12):835-845. doi: 10.1136/jmedgenet-2016-103966. Epub 2016 Sep 1. PubMed PMID: 27587472.

Scotland G, McKeigue P, Philip S, Leese GP, Olson JA, Looker HC, Colhoun HM, Javanbakht M. Modelling the cost-effectiveness of adopting risk-stratified approaches to extended screening intervals in the national diabetic retinopathy screening programme in Scotland. Diabet Med. 2016 Jul;33(7):886-95. doi: 10.1111/dme.13129. Epub 2016 May 11. PubMed PMID: 27040994.

To calculate the required sample size for learning to predict from high-dimensional biomarker panels, classical statistical power calculations are not very useful. What researchers really need is a learning curve, showing how the expected predictive performance of the trained model depends on the size of the training sample from which this model is learned.

I have described a simple method for calculating the sample size
required to learn to classify with a high-dimensional biomarker
panel, based on the asymptotic distribution of the weight of evidence
(log Bayes factor) in a recent paper: Sample
size requirements for learning to classify with high-dimensional
biomarker panels (*Statistical Methods for Medical Research*
2017, final
version now on PubMed). The assumptions underlying this method
are:-

There are no redundant biomarkers in the sense that none of them can be calculated as weighted sums of the others (covariance matrix is of full rank)

The effects of the biomarkers can be approximated by a linear discriminant model in which the class-conditional distributions of the biomarkers are gaussian with the same covariance in cases and noncases.

The class-conditional correlations between the biomarkers are the same as the correlations between their prior effect sizes.

The method is implemented in an online sample size calculator, written with the R shiny package and deployed at https://pmckeigue.shinyapps.io/sampsizeapp/.

To use it, move the sliders to specify:-

the performance of the optimal classifier that could be learned from a training sample of infinite size. This is specified as the C-statistic (area under the ROC curve). C-statistics of 0.80, 0.88 and 0.925 are equivalent to expected information for discrimination of 1, 2 and 3 bits respectively.

The proportion of biomarkers that have nonzero effect sizes, based on a spike-and-slab mixture model for the distribution of effect sizes. For instance, specifying 0.1 is equivalent to specifying that 90% of biomarkers have zero effect size (the spike component), and the remaining 10% have a gaussian distribution of effect sizes with mean zero (the slab component). The app will plot the learning curve based on the proportion that you specify, and also based on a model in which the biomarkers have a gaussian distribution of effect sizes. If the biomarkers in your panel are unselected (for instance a genome-wide profile of gene transcription levels) you may expect that only a small proportion of these biomarkers contain predictive information. If your biomarker panel has been preselected for relevance (for instance markers previously reported to be associated with the outcome under study, you may expect the proportion with nonzero effects to be relatively high.

Then click the “Submit” button

The
table generated by the app has three columns:

the expectation of the information for discrimination extracted by the trained model, as a percentage of the information for discrimination extracted by the optimal model that would be learned from a sample of infinite size. This is scaled from 25% to 90%.

the C-statistic corresponding to these values of expected information for discrimination.

the ratio of cases to variables, assuming a balanced study design with equal numbers of cases and controls.

The app also plots a learning curve, showing how the expected information for discrimination and C-statistic obtained with the trained model depends on the ratio of cases to variables. In this example, where the optimal model has an expected discrimination of 1 bit equivalent to a C-statistic of 0.80, and 1% of biomarkers have nonzero effects, a sample size of at least 0.1 cases per variable is required to learn a model that has expected information for discrimination 60% of that obtained with the optimal model.