I’ve just contributed to a new review ‘Recent developments in using mechanistic cardiac modelling for drug safety evaluation’, which discusses where we are with using mathematical models of the electrical activity of cardiac muscle cells to predict whether new drugs will cause any dangerous changes to heart rhythm. Quite a bit of the review brings up challenges that we’re going to come across as we assess whether or not the simulations are ready to help regulators decide on the safety of new pharmaceutical compounds.

## How do we quantify ‘predictive power’?

As part of this review I got thinking about how reliable our assessments of simulation predictions are. We often think about evaluating predictions in terms of binary (yes/no) outcomes: e.g. do we keep developing a drug or not; do we allow a drug onto the market or not? To look at predictions of this kind of thing we often use a *confusion matrix*, also known as a *contingency matrix/table*, as shown in Figure 1.

In Kylie‘s study evaluating simulation predictions of whether or not compounds caused 10% or more prolongation of the rabbit wedge QT interval, we looked at a decent number of compounds, 77 for simulations based on PatchXpress automated patch clamp technology (and even more for other sources of screening data). Here’s a slightly more elaborate confusion matrix from that study:

In Table 1 I separated out the predictions into three categories (>= 10% shortening, >= 10% prolongation, or ‘no effect’ as between the two), to allow us to evaluate predictions of shortening as well. But you just need to add up the relevant boxes as shown to evaluate predictions of prolongation. So here we do quite well, and it *feels* as if we’ve got enough compounds in the validation set to say something concrete.

But when we’re evaluating a new simulation’s performance, how many validation compounds do we need? Clearly more than 1 or 2, since you could easily get these right by accident, or design. But are 10 validation compounds enough? Or do we need 20?, 30?, 50?, 77 as we used here?, 100?, or even more?

## How does our evaluation depend on the number of validation compounds?

In Figure 2 I’ve plotted how our estimates of the prediction sensitivity and accuracy change (and converge to a roughly steady value) as we increase the number of validation compounds that we consider. Here the estimates happen to ‘level out’ at about N=60 compounds. But by simply permuting the order in which we consider these 77 compounds, and repeating the process, we get another possible graph (two permutations shown in Figure 2).

So if we didn’t have the big picture here, and only had 30 compounds to work with (solid vertical lines in Fig. 2), we could equally well conclude that the sensitivity of the simulation prediction is 35% (which you’d consider rubbish) or 65% (pretty good)! So we have a problem with N=30 validation compounds.

## Can we estimate our uncertainty in the performance of our model?

How do we know how far away the ‘real’ sensitivity and accuracy is likely to be? And by ‘real’ I mean the sensitivity and accuracy if we had an infinitely large set of validation compounds to test.

Luckily, the stats for this is well established (and Dave remembered it when we were doing Kylie’s study!). These measures are a sum of binary yes/no measurements, i.e. do we get the prediction right or not? This means each validation compound can be thought of as a sample from a binomial distribution for the sensitivity or accuracy, and the confidence interval for the underlying parameter ‘p’ (in this case the sensitivity or accuracy themselves) has a nice simple analytic expression first worked out by Wilson in 1927. So this post is certainly nothing new for a statistician! But you very rarely see these confidence intervals presented alongside assessments of sensitivity, specificity or accuracy. Perhaps this is because the confidence intervals often look quite big?!?

So I’ve plotted the Wilson’s Score Interval on top of the sensitivity and accuracy graphs for the two permutations of the compounds in Figure 3, which also appears in the new review paper.

Fig. 3 nicely shows how the intervals narrow with increasing numbers of compounds, and also how the 95% confidence intervals do indeed include the N=77 performance evaluations (our best guess at the ‘real’ numbers) about 95% of the time. So now when we look at just 30 compounds we still get the 35% or 65% sensitivity estimates, but we know that the ‘real’ sensitivity measure could easily be somewhere in 12-63% for the blue compounds, or 35-85% for the red ones, and indeed the N=77 estimate is at 60% – within both of these. In fact, the confidence interval applies at N=77 too, so the ‘real’ value (if we kept evaluating more and more compounds) could still be anywhere in approximately 41-77%; the accuracy estimate has narrowed more, to around 63-82%.

Note that because the sensitivity calculation, shown in Fig. 1, effectively operates with a smaller sample size than the accuracy (just the experimental prolonging compounds instead of all of them) you get wider confidence intervals. Note this means you can’t tell a-priori what number of compounds you’ll need to evaluate, unless you have a fair idea of the positive/negative rate for the Gold Standard measurements.

If you are getting appropriately sceptical, you’ll now wonder why I only showed you two permutations of the compound ordering, isn’t this as bad as only looking at two compounds? Quite right! So here are 50 random permutations overlaid for Sensitivity, Specificity, Cohen’s Kappa value, and Accuracy:

## So, is testing 30 compounds enough?

Well, you can use as many or few compounds as you like, **as long as you evaluate the confidence in your prediction scores** (sensitivity, specificity, accuracy etc.)! This little exercise has persuaded me it’s vital to present confidence alongside the prediction scores themselves (something I’ve not always done myself). In terms of getting confidence intervals that are the kind of width we want for deciding whether or not predictions of QT prolongation are good enough (probably down to around 10%), this little exercise suggests we probably need to be doing validation studies on around 100 compounds.

Nice blog entry. Surely it’s more than just the number of compounds but also how active they are. I could cook up 100 compounds that don’t do anything and build a model and show high predictivity with insanely narrow confidence intervals! You could argue that the data-sets where we have been getting good predictivity scores are because the pharmacological space being explored is narrow and easy, compare the dog v rabbit data-sets. A balanced data-set in terms of activity is surely also just as important as the number of compounds. Then again there is an argument that you want the frequency of the types of activity in your test-set to match what you have coming through the pipeline maybe? Or if you want your model to look good you just pick compounds that are easy and lots of them! If it’s the latter then I would consult CiPA!

Cheers Tarachopoiós!

You’re absolutely right – I tried to get that idea in with the “Note this means you can’t tell a-priori what number of compounds you’ll need to evaluate, unless you have a fair idea of the positive/negative rate for the Gold Standard measurements”, but might not have stressed it enough. Yes, I was thinking that the validation compounds should reflect the type of activity that’s representative of the compounds you’ll be predicting on ‘in production’. That’s certainly what we were doing in the example here from Kylie’s paper where they were real candidate compounds, rather than strongly [in-]active reference compounds.

By the way – I had to look it up, but I love the username…

Thank you for replying.

So in the end someone could argue that if you want your test set to reflect what’s coming through your pipeline then a me too approach (nearest neighbour say) to classifying compounds would suffice to solve this problem? I guess though pipelines may change over time: the chemical space being explored 5 years ago may not be the same one being explored today or is it? A me too approach may fail and so may more complex approaches, so does that mean you may want to re-test your models every few years?

I think a statistical ‘me too’ prediction model would work very nicely in one or two dimensions (proportion of hERG block and CaL block say), but if you want something that will work in higher dimensions of up to 5,6,7 ion channels, then the nearest neighbour might be a very long way away, and biophysically-based models should win! Not that we’re capturing kinetics as standard at the moment, but when we do, that will add many more parameters with non-linear effects, and nearest neighbour would really struggle…

All very good points Gary. Thank you for the interaction. It sounds like we need a prediction competition. Now that would be interesting!

Pingback: Uncertainty quantification for ion channel screening and risk prediction | Mathematical Matters of the Heart

Pingback: Arrhythmic risk: regression, single cell biophysics, or big tissue simulations? | Mathematical Matters of the Heart