# Introduction

This tutorial demonstrates how to use the R package wevid for quantifying the performance of a diagnostic test or predictor of a binary outcome as results described in the paper Quantifying performance of a diagnostic test as the expected information for discrimination: relation to the C-statistic, now in press in Statistical Methods for Medical Research.

## Limitations of the $$C$$-statistic for quantifying performance of a diagnostic test

• It is not obvious what the $$C$$-statistic (area under the ROC curve) tells us about how the test will perform for risk stratification: for instance if a test is used to screen for disease and a risk of 5% is set as the threshold for further investigation, we would like to know what proportion of cases and non-cases will test positive.

• The increment in $$C$$-statistic obtained by adding extra variables (such as a panel of biomarkers) to the model is difficult to interpret as a measure of increment in predictive performance. For instance the increment in $$C$$-statistic will be greater in a case-control study in which covariates such as age have been matched between cases and controls, than in a cohort study in which these covariates differ between cases and controls and have been included in the model, even if the extra variables are independent of the covariates.

• The small increments in $$C$$-statistic that can be obtained by adding variables to a model that has a $$C$$-statistic of 0.9 or above have led to a mistaken belief that no useful increment in predictive performance can be obtained

“Researchers have observed that $$\Delta$$AUC depends on the performance of the underlying clinical model. For example, good clinical models are harder to improve on, even with markers that have shown strong association”

• The crude ROC curve obtained by comparing the scores of all possible case-control pairs at each value of the score in cases and controls is not necessarily concave downwards. For any ROC curve that is not concave, one can produce a classifier superior to that summarized by such an ROC curve (Hand, 2009).

# Historical note

Dorothy Wrinch and Harold Jeffreys (1921) were the first to write Bayes theorem in the odds form, showing that the ratio between likelihoods of hypotheses, later called the Bayes factor, transforms prior odds into posterior odds. This laid the basis for the Bayesian approach to hypothesis testing.

$\left(\textrm{prior odds } \mathcal{H}_1 \colon \mathcal{H}_2 \right) \times \frac{\textrm{likelihood of } \mathcal{H}_1} {\textrm{likelihood of } \mathcal{H}_2} = \left(\textrm{posterior odds } \mathcal{H}_1 \colon \mathcal{H}_2 \right)$

Taking logarithms, we can write this equation in terms of the weight of evidence (log Bayes factor). Weights of evidence contributed by independent observations can be added, just like physical weights. If we use logarithms to base 2, the weight of evidence can be expressed in bits, which have a more intuitive interpretation than natural log units.

$\log{\textrm{prior odds } \mathcal{H}_1 \colon \mathcal{H}_2 } + \textrm{weight of evidence } \mathcal{H}_1 \colon \mathcal{H}_2 = \log{\textrm{posterior odds } \mathcal{H}_1 \colon \mathcal{H}_2 }$

The first practical use of the log Bayes factor to quantify the weight of evidence favouring one hypothesis over another was by Turing at Bletchley Park.   Hut 8, Bletchley Park Alan Turing Jack Good

The Banburismus procedure was based on accumulating weights of evidence for the settings of the right-hand and middle rotors of the Enigma machine. Good recounted in 1994:-

One morning I asked Turing “Isn’t this really Bayes’ theorem?” and he said “I suppose so.” He hadn’t mentioned Bayes previously.

## Sampling distribution of the weight of evidence

To be able to decide in advance whether Banburismus was likely to be successful, Turing began investigating the sampling distribution of the weight of evidence in 1940 or 1941. A list of the properties of weights of evidence derived by Turing was given in a now declassified paper by Good in the NSA Technical Journal. Good and Toulmin (1968) derived an identity mapping the characteristic function $$\phi_1 \left( t \right)$$ of the distribution of the weight of evidence in favour of a hypothesis when it is true to the characteristic function $$\phi_0 \left( t \right)$$ of the distribution of the weight of evidence in favour of that hypothesis when it is false. Their derivation was as follows:-

Suppose that the predictors can take $$J$$ distinct values with probability of the $$j$$th value given by $$p_j$$ when the hypothesis is true and $$q_j$$ when the hypothesis is false. We have:-

\begin{align} \phi_1 \left( t \right) &= \sum{ p_j \exp{ \left( it \log{ \frac{p_j}{q_j}} \right)}} \\ \phi_1 \left( t + i \right) &= \sum{ p_j \exp{ \left( i \left(t + i \right) \log{ \frac{p_j}{q_j}} \right)}} \\ &= \sum{ p_j \exp{ \left( i t \log{ \frac{p_j}{q_j}} \right) } \exp{ \left( -\log{\frac{p_j}{q_j}} \right) }} \\ &= \sum{ q_j \exp{ \left( i t \log{ \frac{p_j}{q_j}} \right) }} \\ &= \phi_0 \left( t \right) \end{align}

This identity can be stated in an alternative form as $$e^W p_1 \left( W \right) = p_0 \left( W \right)$$, where $$p_1 \left( W \right)$$ and $$p_0 \left( W \right)$$ are the densities of the weight of evidence $$W$$ favouring $$\mathcal{H}_1$$ over $$\mathcal{H}_0$$ when $$\mathcal{H}_1$$ is true and when $$\mathcal{H}_0$$ is true respectively. This result can be obtained simply by noting that at any value of $$W$$ the ratio $$p_1 \left( W \right) / p_0 \left( W \right)$$ is the Bayes factor $$e^W$$ favouring $$\mathcal{H}_1$$ over $$\mathcal{H}_0$$.

This identity generalizes two results originally obtained by Turing:-

• If the density $$p_1 \left( W \right)$$ is gaussian with mean $$\mu$$, the density $$p_0 \left( W \right)$$ is gaussian with mean $$-\mu$$ and both densities have variance $$2 \mu$$.
• This can be shown by substituting $$\left( t - i \right)$$ for $$t$$ in the expression $$\exp{\left( i \mu t + \frac{1}{2} \sigma^2 t^2 \right)}$$ for the characteristic function of a gaussian distribution.
• The expected Bayes factor in favour of a hypothesis when it is false is 1, as $$e^{-W} p_0 \left( W \right)$$ must integrate to 1.
• With a good test, the sampling distribution of the Bayes factor in favour of the wrong hypothesis will be right-skewed so that its expectation is 1 even though most of its probability mass is close to zero. A practical implication is that a good diagnostic test will not often be wrong but when it is wrong it may be wildly wrong, giving strong evidence in favour of the wrong diagnosis.

If there are many independent predictors of small effect, the weight of evidence will have its asymptotic gaussian distribution and there is a simple relationship of the $$C$$-statistic to the expected weight of evidence $$\Lambda$$:

$C = 1 - \Phi \left(-\sqrt{\Lambda} \right)$ where $$\Phi \left( \cdot \right)$$ is the standard Gaussian cumulative distribution function. Asymptotic relationship of $$C$$-statistic to expected weight of evidence $$\Lambda$$

The $$C$$-statistic can be viewed as a mapping of the expected weight of evidence $$\Lambda$$, which takes values from 0 to infinity, to the interval from 0.5 to 1. If a biomarker with expected weight of evidence 1 bit is evaluated in a case-control study in which covariates have been matched, the increment in $$C$$-statistic will be from 0.5 to 0.8. If the same biomarker is evaluated in a cohort study in which covariates such as age contribute a weight of evidence of 2 bits, the increment in $$C$$-statistic will be much smaller: from 0.88 to 0.925.

The statistic $$\Lambda$$ has various alternative names: the expected information for discrimination between cases and controls; the Kullback-Leibler (KL) divergence from the class-conditional distribution $$\mathcal{Q}$$ of the predictors under incorrect case-control assignment to their distribution $$\mathcal{P}$$ under correct assignment; or the relative entropy of $$\mathcal{P}$$ with respect to $$\mathcal{Q}$$. As $$\Lambda$$ is a KL divergence, it can take only non-negative values.

## Advantages of using expected weight of evidence $$\Lambda$$ to quantify performance of a diagnostic test

• Contributions of independent variables to predictive performance are additive on the scale of $$\Lambda$$.

• The expected weight of evidence has an intuitive interpretation as the typical factor by which prior odds are updated to posterior odds

• Where there are many independent predictors of small effect, the expectation of the weight of evidence determines its distribution, and this distribution contains all the information required to characterize how the test will behave as a risk stratifier.

• The calculation of weight of evidence can be extended to interval-censored failure-time data, to which the $$C$$-statistic is not applicable because it does not condition on the interval lengths.

## Relation of ROC curve to distributions of weight of evidence

Johnson (2004) noted a simple relationship between the distributions of weight of evidence $$W$$ favouring case over control status in cases and controls and the ROC curve generated from these distributions. If the quantiles of $$W$$ in controls and cases are $$q_0$$ and $$q_1$$ respectively, the sensitivity is $$\left(1 - q_1 \right)$$ and the specificity is $$q_0$$ and the ROC is the curve obtained by plotting $$\left(1 - q_1 \right)$$ as a function of $$\left(1 - q_0 \right)$$. The gradient of this function is

$\frac{dq_1}{dq_0} = \frac{d q_1 / d W}{d q_0 / dW} = \frac{p_1 \left( W \right)}{p_0 \left( W \right)}= \exp{ \left( W \right) }$

As $$q_0\left( W \right)$$ increases with $$W$$, it follows that the gradient of this model-based ROC curve is a monotonic decreasing function of $$\left(1 - q_0 \right)$$, unlike the crude ROC curve calculated from ranking the scores of cases and controls. In other words, the model-based ROC curve is concave downwards.

The problem is to estimate the distributions of the weight of evidence $$W$$ when the hypothesis is true and when it is false, subject to the constraint that these distributions must be mathematically consistent.

# The R package wevid

This package provides functions for computing and plotting the distributions of the weight of evidence. A version will eventually be submitted to CRAN. For now the package can be installed with the following commands :-

git clone https://github.com/pmckeigue/wevid
R CMD build wevid
# replace version number of the tar.gz file with whatever is created by the build command
R CMD INSTALL wevid_0.2.0.999.tar.gz

wevid can use the output of any binary classifier as long as the classifier generates three vectors on test data or test folds formed by cross-validation: the observed values of an outcome y, the predictive probabilities posterior.pthat $$y = 1$$ (from a model learned on training data), and the prior probabilities prior.p that $$y = 1$$ (usually just the frequency in the training data).

This package includes three example datasets: pima, cleveland, fitonly. Each dataset consists of a data frame with three columns: prior probability prior.p, posterior probability posterior.p, and binary outcome y. Each of these predictions was generated from the source dataset by fitting a logistic regression model with hierarchical shrinkage prior on the predictors, with predictions generated on test folds by 40-fold cross-validation. The three source datasets from which the predictions were generated are:-

# Examples

## Coronary disease in Cleveland

library(wevid)
library(pander)
library(ggplot2)
library(gtable)
library(grid)
library(gridExtra)

theme_set(theme_grey(base_size = 18))

data(cleveland)
cleveland.densities <- with(cleveland,
Wdensities(y, posterior.p, prior.p))
   1    2
0.00 1.95
Density with 2 mixture components chosen by BIC
pander(summary(cleveland.densities), table.style="multiline",
split.cells=c(5, 5, 5, 5, 5, 5, 5), split.table="Inf",
caption="Prediction of coronary disease in Cleveland")
Prediction of coronary disease in Cleveland
Cases / controls Crude C-statistic Model-based C-statistic Crude Λ (bits) Model-based Λ (bits) Test log-likelihood (nats) log-likelihood after recalibration (nats)
137 / 160 0.895 0.899 2.49 2.26 -119.2 -118.7

The crude estimate of the expected information for discrimination $$\Lambda$$ for the classifier learned on the Cleveland dataset is 2.49 bits. The model-based estimate differs only slightly.

The distributions in cases and controls of the weight of evidence favouring case over control status are plotted below.

 plotWdists(cleveland.densities) + theme_grey(base_size = 18) + theme(legend.title=element_blank())