Are 30 compounds enough to test our simulations?

I’ve just contributed to a new review ‘Recent developments in using mechanistic cardiac modelling for drug safety evaluation’, which discusses where we are with using mathematical models of the electrical activity of cardiac muscle cells to predict whether new drugs will cause any dangerous changes to heart rhythm. Quite a bit of the review brings up challenges that we’re going to come across as we assess whether or not the simulations are ready to help regulators decide on the safety of new pharmaceutical compounds.

How do we quantify ‘predictive power’?

As part of this review I got thinking about how reliable our assessments of simulation predictions are. We often think about evaluating predictions in terms of binary (yes/no) outcomes: e.g. do we keep developing a drug or not; do we allow a drug onto the market or not? To look at predictions of this kind of thing we often use a confusion matrix, also known as a contingency matrix/table, as shown in Figure 1.

Confusion Matrix

Figure 1: a ‘confusion matrix’, adapted from Wikipedia. For us the ‘Condition’ / Gold Standard would be the experimental measurement, and the ‘Test Outcome’ would be the simulation prediction.

In Kylie‘s study evaluating simulation predictions of whether or not compounds caused 10% or more prolongation of the rabbit wedge QT interval, we looked at a decent number of compounds, 77 for simulations based on PatchXpress automated patch clamp technology (and even more for other sources of screening data). Here’s a slightly more elaborate confusion matrix from that study:

A confusion matrix for our rabbit wedge study

Table 1: Simulation performance when evaluating predictions of whether or not a drug compound will cause over 10% prolongation of the ECG waveform in the rabbit wedge, the numbers are the numbers of validation compounds which fall into each box. Note this isn’t a standard confusion matrix, as we’ve got three categories instead of two, and I’ve got the rows and columns swapped relative to the one above(!) sorry. But you can see the calculations that give us a sensitivity, specificity and accuracy.

In Table 1 I separated out the predictions into three categories (>= 10% shortening, >= 10% prolongation, or ‘no effect’ as between the two), to allow us to evaluate predictions of shortening as well. But you just need to add up the relevant boxes as shown to evaluate predictions of prolongation. So here we do quite well, and it feels as if we’ve got enough compounds in the validation set to say something concrete.

But when we’re evaluating a new simulation’s performance, how many validation compounds do we need? Clearly more than 1 or 2, since you could easily get these right by accident, or design. But are 10 validation compounds enough? Or do we need 20?, 30?, 50?, 77 as we used here?, 100?, or even more?

How does our evaluation depend on the number of validation compounds?

In Figure 2 I’ve plotted how our estimates of the prediction sensitivity and accuracy change (and converge to a roughly steady value) as we increase the number of validation compounds that we consider. Here the estimates happen to ‘level out’ at about N=60 compounds. But by simply permuting the order in which we consider these 77 compounds, and repeating the process, we get another possible graph (two permutations shown in Figure 2).

graphs of sensitivity and accuracy converging with the number of compounds we validate against

Figure 2: sensitivity and accuracy with increasing numbers of compounds. Two random permutations of the 77 available compounds are considered.

So if we didn’t have the big picture here, and only had 30 compounds to work with (solid vertical lines in Fig. 2), we could equally well conclude that the sensitivity of the simulation prediction is 35% (which you’d consider rubbish) or 65% (pretty good)! So we have a problem with N=30 validation compounds.

Can we estimate our uncertainty in the performance of our model?

How do we know how far away the ‘real’ sensitivity and accuracy is likely to be? And by ‘real’ I mean the sensitivity and accuracy if we had an infinitely large set of validation compounds to test.

Luckily, the stats for this is well established (and Dave remembered it when we were doing Kylie’s study!). These measures are a sum of binary yes/no measurements, i.e. do we get the prediction right or not? This means each validation compound can be thought of as a sample from a binomial distribution for the sensitivity or accuracy, and the confidence interval for the underlying parameter ‘p’ (in this case the sensitivity or accuracy themselves) has a nice simple analytic expression first worked out by Wilson in 1927. So this post is certainly nothing new for a statistician! But you very rarely see these confidence intervals presented alongside assessments of sensitivity, specificity or accuracy. Perhaps this is because the confidence intervals often look quite big?!?

So I’ve plotted the Wilson’s Score Interval on top of the sensitivity and accuracy graphs for the two permutations of the compounds in Figure 3, which also appears in the new review paper.

Sensitivity and accuracy with increasing numbers of compounds. Two random permutations of the 77 available compounds are considered.

Figure 3: same as Figure 2, but this time with 95% confidence intervals overlaid, calculated using Wilson’s score interval.

Fig. 3 nicely shows how the intervals narrow with increasing numbers of compounds, and also how the 95% confidence intervals do indeed include the N=77 performance evaluations (our best guess at the ‘real’ numbers) about 95% of the time. So now when we look at just 30 compounds we still get the 35% or 65% sensitivity estimates, but we know that the ‘real’ sensitivity measure could easily be somewhere in 12-63% for the blue compounds, or 35-85% for the red ones, and indeed the N=77 estimate is at 60% – within both of these. In fact, the confidence interval applies at N=77 too, so the ‘real’ value (if we kept evaluating more and more compounds) could still be anywhere in approximately 41-77%; the accuracy estimate has narrowed more, to around 63-82%.

Note that because the sensitivity calculation, shown in Fig. 1, effectively operates with a smaller sample size than the accuracy (just the experimental prolonging compounds instead of all of them) you get wider confidence intervals. Note this means you can’t tell a-priori what number of compounds you’ll need to evaluate, unless you have a fair idea of the positive/negative rate for the Gold Standard measurements.

If you are getting appropriately sceptical, you’ll now wonder why I only showed you two permutations of the compound ordering, isn’t this as bad as only looking at two compounds? Quite right! So here are 50 random permutations overlaid for Sensitivity, Specificity, Cohen’s Kappa value, and Accuracy:

Sensitivity, Specificity, Kappa and Accuracy

Figure 4: 50 repeats of this permutation of compound ordering exercise, with black dots being the assessed predictivity index, and overlaid confidence intervals for each (click for a bigger version, sorry I couldn’t make MatLab play nicely with the sizes of the axes labels).

So, is testing 30 compounds enough?

Well, you can use as many or few compounds as you like, as long as you evaluate the confidence in your prediction scores (sensitivity, specificity, accuracy etc.)! This little exercise has persuaded me it’s vital to present confidence alongside the prediction scores themselves (something I’ve not always done myself). In terms of getting confidence intervals that are the kind of width we want for deciding whether or not predictions of QT prolongation are good enough (probably down to around 10%), this little exercise suggests we probably need to be doing validation studies on around 100 compounds.

Posted in Action Potential Models, Drug action, Ion Channel Models, Safety Pharmacology, Stats and Inference | Tagged , , , , | 7 Comments

Model Complexity

We’ve been thinking about how to parameterise/train/calibrate models, and also how to select an appropriate model to begin with. This raises all sorts of interesting questions on the details, but this post is just about setting out a big overview of how this sort of thing might work, in an ideal scenario.

We’ll go through the concepts with the help of the diagram I’ve drawn in Figure 1. We initially sketched this out for Kylie‘s thesis, she’s the one doing all the work on this for the hERG channel models at the moment, but it didn’t make the final cut there, so here it is instead.

A diagram showing how these concepts are related.

Figure 1: model complexity. The relationship between the complexity of your model (* relative to the available training data’s information content); the error you will get in calibration/training; the uncertainty in parameters (the inverse of ‘identifiability’ if you like, how ill-posed the parameter fitting problem is); and the fit that you will get when predicting validation data.

First off, if you have an overly simple model, sitting at the top of Figure 1, you probably won’t have much ambiguity in the parameter values it should take for the best fit to training data, but the fit to training data isn’t likely to be very good, and the prediction of new cases in validation is likely to be worse.

But, it is always possible to get a better fit to training data by increasing the complexity of your model (adding free parameters, extra equations etc.):

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”
John von Neumann

In statistical models/machine learning approaches this is called “overfitting”, but the same applies to adding lots of complexity (postulated reactions or ion channel states etc.) into biophysical models. For some examples of overfitting see a google image search on ‘overfitting’ to give you a good idea how it happens, and why it might lead to really bad predictions later on.

Figure 2: Overfitting. From Twitter, possibly attributable to @ResearchMark, but I couldn’t find the original source, sorry!

So you can make your model overly-complicated, but this is subtle: there’s the possibility that if you try and fit too many parameters to a dataset that isn’t “information rich” enough, you’ll also run into problems. I’ve shown an example of this in Figure 3, now we’re at the bottom of Figure 1, the model is too complex for the training data to constrain in panel A, and so gives rubbish predictions in panel B (and a whole range of predictions depending at random on our initial guess at the parameters! This is a bad thing). But if we change the training data to a more ‘information-rich’ protocol that allows us to identify the parameters better, then we can make good predictions and we are back in the ‘optimal’ setting. So “Model Complexity” really means ‘model complexity relative to a given training protocol’.

A diagram showing how fits can be perfect, but predictions don't need to be.

Figure 3: Predicting with too much uncertainty in parameters (click to see a bigger version). This is an example using synthetic data generated from the ten Tusscher IKr model, but that doesn’t really matter for now. Blue dots – data, Red lines – fits (in A and C) and predictions (in B and D). A & B are from one fitting procedure – fitting to data in A, and predicting data in B, repeated from 20 different initial guesses at parameters. You get 20 different optimal parameter sets, all of which give a great fit to training data (A), but then give 20 different predictions in validation data (B), with inset showing how bad it can be. Conversely, if we train to the data in C, 20 different guesses again, then all parameters converge to the same guess (there is low uncertainty in parameters), and all predict the same thing, the right behaviour, in D. For more info on this see Fink & Noble (2009), and Fink et al. (2011) where a version of this figure first appeared.

So that was an example of a model with too many free parameters to constrain in a certain situation. It is quite possible, and easy, to write down a model that has too many free parameters to constrain with any of your available data. Or you could even write down a whole range of models of high complexity that fit the data perfectly, but will lead to different predictions. What should you do in this case? I am still a fan of simplifying the model down to one with identifiable parameters*, and testing out the simplest model that’s a decent fit with identifiable parameters. I make that sound easy, but in general it isn’t.

So which model should you use? Well, things like Information Criteria will try and let you guess from just the training data fit and model complexity. But you’ve no guarantee that they are going to accurately predict the profile of your validation error on the right of Figure 4, although that is what they’re effectively trying to do. So it is absolutely crucial to do the last step of making some validation predictions.

Making validation predictions sounds simple enough, but opens up more questions: how do I choose which validation experiment to perform? If I choose one that is too simple/similar to the training/calibration experiment, then I could trick myself into thinking I’m doing really well (and kid myself that the green shape on the right is really the same shape as the red training triangle). Most of the literature will tell you to do something based on the ‘context of use’, that is, situations as similar as possible to where you want to use the model ‘for real’. e.g. if I want to use a model to predict cardiac effects at 1Hz, then test it out at just 1Hz. This makes perfect sense for statistical/machine learning models.

With biophysical models, I don’t think it is that simple, and as usual it probably depends on what you are making the model for. If you want to predict some behaviour in future, then yes, do the ‘context of use’ simulations to see how well you are likely to do. But you might want to know whether your biophysical model is a good description of the biophysics – perhaps the biophysics itself is what you are trying to learn about, or perhaps you can’t be sure what the ‘context of use’ is (voltage waveforms in arrhythmias?). The biophysics should hold – not only outside the training data regime – in any regime where the underlying assumptions you built in about physical processes hold. In this case, perhaps it makes sense to throw the wackiest validation protocol you can think of at the model, to see if it really does capture the underlying biophysics? So we’re trying that, and I’ll let you know how we get on!

You also have to avoid falling into the trap of using validation predictions to perform the model selection for you. Sounds like an obvious thing to do, but then you have turned your ‘validation data’ into ‘model selection training data’! This is easy to do by accident, and really if you do this you should have a third set of data – sometimes called ‘test data’ to see how well the calibrated and selected model performs. Often machine learning people do work with three datasets like this – see Chapter 7 of Hastie et al.’s Elements of Statistical Learning book for a good intro.

*As an aside, people often talk about a model being “identifiable” (i.e. “the parameters of the model are identifiable”). But we have to remember that only makes any sense as shorthand for the statement “the parameters of the model are identifiable, given this training/calibration data“. Which immediately raises the question of “What if we calibrate the parameters to the results of a different experiment?”, as the example in Figure 3 shows, which is getting me into the whole new field of optimal experimental design…

Posted in Action Potential Models, Ion Channel Models, Model Development, Stats and Inference | Tagged , , , , , , , , | Leave a comment

If computational biologists did astronomy…

Here is a cheeky post along the lines of Can a biologist fix a radio? (well worth a read). The discovery that Pluto wasn’t the size we expected got me thinking about how we would have tackled it. So in our case: Can a computational biologist do astronomy?

Our protagonists are a Mathematical / Computational Biologist MCB, and a Noteworthy Astronomy Scholar with Antiquated ideas, NASA. (If anyone’s initials are MCB, I forgot and genuinely didn’t mean you!)

MCB: I am going to make a model of how Pluto moves under gravity.

NASA: we know that Pluto is subject to gravitational forces, so its orbit depends on its total mass according to Newton’s law of universal gravitation.

MCB: Great, so if I could measure the density and radius of Pluto then I could work out its mass with M = density x volume, and then I could work out the gravitational forces on it.

NASA: Well, if we say it is a uniform sphere, then I suppose so yes.

MCB: Good. I will do some calculations…

MCB: I have varied the radius and density of Pluto and discovered that Pluto can have almost any orbit you like.

NASA: By the way, we have watched the orbit, and it is very consistent.

MCB: Interesting! I will do some more calculations to see which densities and radii it uses.

NASA: Er, we don’t really have any evidence that Pluto is varying its radius and density. It might make sense just to work with its total mass for now.

MCB: That’s all very well, but we don’t have any evidence that it isn’t changing its radius and density all the time do we?


MCB: In fact this would make sense, because Pluto’s orbit must be robust to fluctuations in radius and density that occur when space dust hits it stochastically. It probably varies them continuously to adapt to getting hotter when it is closer to the sun too, and all sorts of other things. I will do some more calculations…

MCB: The radius and density of Pluto could take almost any values and still give that orbit! Pluto is robust to changes in its radius and density [as long as they fall near a line that looks a bit like 1/density is proportional to radius^3].

NASA: [You’ll probably find that constant of proportionality is 4/3 pi], I think we might need new equipment to measure the radius and density, you should probably stick with a simple model with just total mass for now.

MCB: All the evidence we have is completely consistent with Pluto taking a wide range of radii and densities, I can make some interesting predictions for what happens to Pluto in new situations at different radii and densities and publish lots of papers about them.

NASA: Well it might take various values, but we’d better go and check.

[Some time later…]

NASA: We’ve just measured Pluto, its radius and density weren’t what we guessed, but they take constant values!

[The End]

In case you didn’t get my not-so-subtle point, I have read quite a few papers linking un-identifiability of parameters in inverse problems as plausibility that the parameters really do take a wide range of values, backed up with an argument that they must to confer biological robustness, or explain variability. Biological systems obviously do have to be robust to stochastic transcription, noise and a wide range of conditions. But lack of information about a quantity is not evidence that it varies in reality: certain things are very carefully controlled by numerous feedback mechanisms e.g. blood temperature. Don’t get me wrong, I think understanding variability and uncertainty is absolutely crucial. But if you are relying on something varying a lot for your results, I would say the onus is on you to design an experiment where you can measure it as well as possible, and do it lots of times to show how it varies.

Posted in Action Potential Models, Silly, Stats and Inference | Tagged , , , | Leave a comment

Job Advert

A quick shameless plug for a job we are advertising.

This is a 3 year postdoctoral position to work with me in the Computational Biology Group in the Department of Computer Science, University of Oxford. It would best suit someone with a PhD in a subject such as: Bayesian inference; model selection; model calibration; uncertainty quantification; or optimal experimental design.

The idea will be to apply these kind of techniques to various problems in electrophysiology modelling, to allow us to begin to assign probabilities to predictions. The post would suit someone who is looking to apply their statistical skills to real problems, and to learn about biophysically-based modelling and its applications for biological problems. No prior experience of modelling these kind of things is necessary, or even desirable!

More details and a link to the official Oxford job portal can be found on the CS news feed. Please contact me at for any informal enquiries about the post.

Posted in Future developments | Tagged , , , , , , | Leave a comment

Triangles and Conductances

Just a little post on something that’s really simple, not new, and might seem obvious to some people – but might help when choosing how to explore the effect of altering parameters on action potential model behaviour.

The reason I thought I’d post this is that people have expressed surprise that they don’t see symmetric sized effects on action potentials when they make a maximal conductance bigger or smaller. By which, I mean that adding/subtracting a constant percentage (e.g.) +/- 50% to a conductance doesn’t lead to the same ‘balanced’ changes in action potential properties. But I don’t think you should expect that, and why not is all to do with triangles!

Remembering our GCSE maths, the simple equation for a straight line is:

y(x) = mx + c. ...(1)

The equation of a straight line

Figure 1: The equation of a straight line

Where m is the gradient of the line, or dy/dx. If you want to consider where the triangle in Figure 1 hits the x-axis (y = 0), just re-arrange:

x = -c/m ...(2)

which is positive in Figure 1 since m is a negative gradient.

If you think of a cardiac action potential model, we usually write something that looks like

dV/dt = -(g*open_probability*driving_voltage + same for each other current)

where the ‘g’ is the maximum conductance of a current, proportional to the number of available ion channels in the cell membrane.

So thinking of the triangle as a simplified action potential*, the gradient is given by m = dV/dt and looking at the intercept with the x-axis as the action potential duration (APD), we would write (2) as

APD = -c/(dV/dt) = c/(g*stuff) ...(3)

(I will drop the ‘stuff’ for simplicity). From this simple equation, you can see that to double APD you would have to halve ‘g’, and to halve APD you would have to double ‘g’. This has some consequences for exploring model behaviour.

Two graphs of 1/x

Figure 2. Left: the effect of altering the gradient by up to + 80% (APD can be reduced by up to 45%), Right: the effect of altering the gradient by up to -80% (APD can increase by up to 500%). Red line is 1/x, black dashed line is the identity line y=x. You wouldn’t believe how long it took me to make matlab do that overlapping shading thing.

In Figure 2 I’ve highlighted a couple of things to note from equation (3). I’ve plotted ‘APD’ as a function of g (a plot of 1/x). If we increase g by 80%, by changing its value from 1 to 1.8, we can see the APD is reduced from 1 to 1/1.8, by about -45%. Decreasing ‘g‘ by 80%, from 1 to 0.2, has a much larger effect, with increase in APD from 1 to 1/0.2, a change of +500%.

A graph of 1/x with x=0.5 to 2 highlighted.

Figure 3: scaling a conductance by multiplying/dividing by the same factor allows you to explore the same changes to behaviour.

In Figure 3 I’ve showed an alternative way of exploring the space. By multiplying/dividing by a factor (of 2 in what I’ve shown) we get a symmetric plot, so we can see changes of 2x or 1/2 in APD by exploring the same range in conductance ‘g’.

By way of a more useful example I’ve varied the calcium current conductance g_CaL in the ten Tusscher 2006 (epi) model, and show the results in Figure 4.

ten Tusscher 2006 action potentials

Figure 4: An example with ten Tusscher 2006 (epi), in both plots black is the control AP, red is with decreased g_CaL and blue is with increased g_CaL. Left: g_CaL scaled by +/-50%. Right: g_CaL scaled by *2 or /2. Here, in contrast to my advice in this post, I’ve just looked at one pace of simulation to avoid too many other effects coming in to confound this!

Just as we expected from the triangle analogy, we get far more ‘symmetry’ in the response when scaling g_CaL rather than changing it by +/- a fixed percentage (not exactly symmetric because the nonlinear parts of the system come into play too, as different currents get affected by the alteration to voltage).

You might decide that biology has set up its machinery in such a way as to make changes in conductances of +/-50% more likely than +100%/-50%, so you are happy seeing the consequences of varying parameters like that, which is fine. But if you want to do an inverse problem, and see which parameters could describe a wide range of APDs, then I’d suggest it makes sense to explore the parameters by scaling instead of percentage changes.

*of course, this is pretending there is only one (constant) ion current. But as we can see in the example in Figure 4, the principle on the possible effects of conductance changes still applies with more currents. Incidentally, none of the above argument needs the triangle to be the whole action potential, if you zoom in on any little bit of action potential it will look like a triangle, and the same argument therefore applies to any action potential property you care to look at…

Posted in Action Potential Models, Drug action | Tagged , , , , , , | Leave a comment

Hypotheses on hypotheses on …

(Subtitle: building on foundations of unknown shakiness?)

In a lot of fields, a new mathematical model is constructed for each problem, and the model is an encoding of the quantitative rules of the hypothesis you are testing. The model is the hypothesis.

But in cardiac electrophysiology, we often use ‘off the shelf’ previously published models (which implicitly encode many ‘hypotheses’ about how each current depends on voltage etc. etc.). Then we use these models to test a completely different sort of hypothesis with a simulation study: perhaps at the cell scale, something like ‘alternans occur at a lower pacing rate when we change action potential model parameters to look like failing hearts’; or perhaps at the tissue scale e.g. ‘discontinuities in tissue properties make spiral waves more likely to break up’. I’ll call these examples of a secondary hypothesis.

The thing is, the secondary hypothesis is usually stated as the sole hypothesis that a particular study/article/paper is testing. Even if we decide that the secondary hypothesis is true in our simulation study, that finding is completely reliant on the underlying primary hypotheses all being true too. These might not ever have been tested (especially because we probably haven’t unpicked exactly what behaviours at the primary level we rely on to see the result at the secondary level).

No answers in this post on how we should deal with this! Just a thought that we need to acknowledge the primary hypotheses are still just hypotheses, there may be competing ones, and it would be nice to figure out (and record!) which ones we’re relying on in which ways… Until we can do this in a sensible way, multiscale model building in a plug-and-play way is going to be… er… hard!

Edit: I first wrote this draft quite a long time ago, I think some uncertainty quantification at the level of the primary hypotheses might be the way to go (assigning probabilities to them, and examining the consequences for secondary hypotheses), but open for comments below!

Posted in Action Potential Models, Future developments, Tissue Simulations | Tagged , , , , , | Leave a comment

Cyclamen purpurascens

Alex Hyde is a friend of mine and a professional photographer, so I put him on a “Yorkshireman’s commission” as he called it (two pints of beer), to capture me something heart-related for the blog header image.

So here it is, and considering he had to go all the way to some woods in Croatia to get it, a bargain I think. It is an Alpine Cyclamen or Cyclamen purpurascens.

Alpine Cyclamen

Alpine cyclamen photographed by Alex Hyde

Have a look at Alex’s website and follow him on Twitter – particular specialities are amazing close-ups of bugs and camouflage shots. Buy some prints for your friends!

Posted in Blog | Tagged | Leave a comment