I’ve just contributed to a new review ‘Recent developments in using mechanistic cardiac modelling for drug safety evaluation’, which discusses where we are with using mathematical models of the electrical activity of cardiac muscle cells to predict whether new drugs will cause any dangerous changes to heart rhythm. Quite a bit of the review brings up challenges that we’re going to come across as we assess whether or not the simulations are ready to help regulators decide on the safety of new pharmaceutical compounds.

## How do we quantify ‘predictive power’?

As part of this review I got thinking about how reliable our assessments of simulation predictions are. We often think about evaluating predictions in terms of binary (yes/no) outcomes: e.g. do we keep developing a drug or not; do we allow a drug onto the market or not? To look at predictions of this kind of thing we often use a *confusion matrix*, also known as a *contingency matrix/table*, as shown in Figure 1.

In Kylie‘s study evaluating simulation predictions of whether or not compounds caused 10% or more prolongation of the rabbit wedge QT interval, we looked at a decent number of compounds, 77 for simulations based on PatchXpress automated patch clamp technology (and even more for other sources of screening data). Here’s a slightly more elaborate confusion matrix from that study:

In Table 1 I separated out the predictions into three categories (>= 10% shortening, >= 10% prolongation, or ‘no effect’ as between the two), to allow us to evaluate predictions of shortening as well. But you just need to add up the relevant boxes as shown to evaluate predictions of prolongation. So here we do quite well, and it *feels* as if we’ve got enough compounds in the validation set to say something concrete.

But when we’re evaluating a new simulation’s performance, how many validation compounds do we need? Clearly more than 1 or 2, since you could easily get these right by accident, or design. But are 10 validation compounds enough? Or do we need 20?, 30?, 50?, 77 as we used here?, 100?, or even more?

## How does our evaluation depend on the number of validation compounds?

In Figure 2 I’ve plotted how our estimates of the prediction sensitivity and accuracy change (and converge to a roughly steady value) as we increase the number of validation compounds that we consider. Here the estimates happen to ‘level out’ at about N=60 compounds. But by simply permuting the order in which we consider these 77 compounds, and repeating the process, we get another possible graph (two permutations shown in Figure 2).

So if we didn’t have the big picture here, and only had 30 compounds to work with (solid vertical lines in Fig. 2), we could equally well conclude that the sensitivity of the simulation prediction is 35% (which you’d consider rubbish) or 65% (pretty good)! So we have a problem with N=30 validation compounds.

## Can we estimate our uncertainty in the performance of our model?

How do we know how far away the ‘real’ sensitivity and accuracy is likely to be? And by ‘real’ I mean the sensitivity and accuracy if we had an infinitely large set of validation compounds to test.

Luckily, the stats for this is well established (and Dave remembered it when we were doing Kylie’s study!). These measures are a sum of binary yes/no measurements, i.e. do we get the prediction right or not? This means each validation compound can be thought of as a sample from a binomial distribution for the sensitivity or accuracy, and the confidence interval for the underlying parameter ‘p’ (in this case the sensitivity or accuracy themselves) has a nice simple analytic expression first worked out by Wilson in 1927. So this post is certainly nothing new for a statistician! But you very rarely see these confidence intervals presented alongside assessments of sensitivity, specificity or accuracy. Perhaps this is because the confidence intervals often look quite big?!?

So I’ve plotted the Wilson’s Score Interval on top of the sensitivity and accuracy graphs for the two permutations of the compounds in Figure 3, which also appears in the new review paper.

Fig. 3 nicely shows how the intervals narrow with increasing numbers of compounds, and also how the 95% confidence intervals do indeed include the N=77 performance evaluations (our best guess at the ‘real’ numbers) about 95% of the time. So now when we look at just 30 compounds we still get the 35% or 65% sensitivity estimates, but we know that the ‘real’ sensitivity measure could easily be somewhere in 12-63% for the blue compounds, or 35-85% for the red ones, and indeed the N=77 estimate is at 60% – within both of these. In fact, the confidence interval applies at N=77 too, so the ‘real’ value (if we kept evaluating more and more compounds) could still be anywhere in approximately 41-77%; the accuracy estimate has narrowed more, to around 63-82%.

Note that because the sensitivity calculation, shown in Fig. 1, effectively operates with a smaller sample size than the accuracy (just the experimental prolonging compounds instead of all of them) you get wider confidence intervals. Note this means you can’t tell a-priori what number of compounds you’ll need to evaluate, unless you have a fair idea of the positive/negative rate for the Gold Standard measurements.

If you are getting appropriately sceptical, you’ll now wonder why I only showed you two permutations of the compound ordering, isn’t this as bad as only looking at two compounds? Quite right! So here are 50 random permutations overlaid for Sensitivity, Specificity, Cohen’s Kappa value, and Accuracy:

## So, is testing 30 compounds enough?

Well, you can use as many or few compounds as you like, **as long as you evaluate the confidence in your prediction scores** (sensitivity, specificity, accuracy etc.)! This little exercise has persuaded me it’s vital to present confidence alongside the prediction scores themselves (something I’ve not always done myself). In terms of getting confidence intervals that are the kind of width we want for deciding whether or not predictions of QT prolongation are good enough (probably down to around 10%), this little exercise suggests we probably need to be doing validation studies on around 100 compounds.