Arrhythmic risk: regression, single cell biophysics, or big tissue simulations?

On the 15th March I presented at an FDA public Advisory Committee hearing on the proposals of the CiPA initative. CiPA aims to replace the current testing for increased drug-induced Torsade de Pointes (TdP) arrhythmic risk, which is a human clinical trial, with earlier pre-clinical testing (including mathematical modelling) that could give a more accurate assessment without the need for a human trial.  Most of my talk (available on Figshare) was about the rationale for using a biophysically-based mechanistic model to classify novel compounds’ TdP risk, the history of cardiac modelling, and how simulations might fit into the proposals.

The advisory committee asked some great questions, and I thought it was worth elaborating on one of my answers here. To summarise, quite a few of their questions came down to “Why don’t you include more detail of known risk factors?”. Things they brought up include:

  • long-short-long pacing intervals is often observed in the clinic prior to Torsade de Pointes starting, why not include that?;
  • should we model Purkinje cells rather than ventricular cells (perhaps ectopic beats or arrhythmias arise in the Purkinje fibre system)?;
  • heart failure is a known risk factor – would modelling these conditions help?

Mechanistic markers

Before answering, it’s worth considering where we are now in terms of ‘mechanistic’ markers of arrhythmic risk. Figure 1 shows how things are assessed at the moment.

Fig 1: block of hERG -> prolongation of action potential duration (APD) -> prolongation of QT interval on the body surface ECG. Taken from my 2012 BJP review paper.

It was observed that the risky drugs withdrawn from the market in the late 90s prolonged the QT interval, and for these compounds this was nicely mechanistically linked to block of the hERG channel/IKr (top panel of Fig 1). This all makes nice mechanistic sense – a prolonged QT is related to delayed repolarisation (as in bottom of Fig 1), which in turn is related to block of hERG/IKr.

There’s a couple of reasons prolonged repolarisation is conceptually/mechanistically linked to arrhythmia. Firstly, if you delayed repolarisation ‘a bit more’ (continue to decrease the slope at the end of the action potential – middle panel of Fig 1), you’d get repolarisation failure, or after-depolarisations. Secondly, by delaying repolarisation you may cause regions of tissue to fail to be ready for the following wave, termed ‘functional block’.

As a result of the pathway from hERG block to QT prolongation, early ion channel screening focusses on checking compounds don’t block hERG/IKr. The clinical trials tried to avoid directly causing arrhythmias, for obvious reasons, but by looking for QT prolongation in healthy volunteers you would hopefully spot compounds that could have a propensity to cause arrhythmias in unhealthy heart tissue, people with ion channel mutations, people on co-medication with other slightly risky compounds, or other risk factors. This has been remarkably successful, and there have very few surprises of TdP-inducing compounds sneaking past the QT check without being spotted.

But, there were some hERG blockers on the market that didn’t seem to cause arrhythmias. Our 2011 paper showed why that can happen –  there are different mechanistic routes to get the same QT or APD changes (by blocking multiple ion channels rather than just hERG) and if you took multiple ion channel block into account you would get better predictions of risk than simply using the early hERG screening results. So multiple ion channel simulations of single cell APD are a very similar idea to clinical trials of QT (and comparing the two is a good check that we roughly understand a compound’s effects on ion channels).

So clinical QT/simulated APD is a mechanistically-based marker of arrhythmic risk, but we know it still isn’t perfect because some drugs with similar QT prolongation in our healthy volunteers have different arrhythmic risks (see CiPA papers for an intro).

One extreme: as detailed as possible

At one end of the scale, some studies advocate whole-organ simulations of TdP in action to assess TdP risk. Here’s a video of the impressive UT-heart simulator from Tokyo that was used in that study.

These simulations definitely have their place in helping us understand the origin of TdP, how it is maintained/terminates, and possibly helping design clinical interventions to deal with it. If we want to go the whole hog and really assess TdP risk completely mechanistically why not do patient-specific whole organ TdP simulations, with mechanics, individualised electrophysiology models, all the known risk factors, the right concentrations of compound, and variations of these throughout the heart tissue, etc. etc.?

Let’s imagine for a minute that we could do that, and got a model that was very realistic for individual patients, and we could run simulations in lots of different patients in different disease states, and observe spontaneous initiation of drug-induced TdP via the ‘correct’ mechanisms (this hypothetical situation ignores the not-inconsiderable extra uncertainties in how well we model regional changes in electrophysiology, what changes between individuals, blood pressure regulation models, individual fibre directions, accurate geometry, etc. etc. etc. which might mean we get more detail but less realism than a single cell simulation!). Let’s also say we could get these huge simulations to run in real time – I think the IBM Cardioid software is about the fastest, and goes at about three times less than real time on the USA’s biggest supercomputer.

That would be brilliant for risk assessment wouldn’t it?

Unfortunately not!

TdP is very rare – perhaps occurring once in 10,000 patient-years of dosing for something like methadone. Which means ultra-realistic simulations would have to run for 10,000 years just to get an N=1 on the world’s biggest supercomputer! It’s going to be quite a while before Moore’s law makes this feasible…

Inevitably then, we are not looking to model all the details of the actual circumstances in which TdP arises, we’re looking for some marker that correlates well with the risk of TdP.

The other extreme: as simple as possible

The other extreme is perhaps forgetting about mechanism altogether, and simply using a statistical model, based on historical correlations of IC50s with TdP, to assess the risk. Hitesh Mistry did something a bit like this in this paper (although as I’ve said at conferences – it’s not really a simple statistical correlation model, it’s really a clever minimal biophysically based model, since it uses Hill equation and balance of depolarising and repolarising currents!). But for two or three ion channel block it works very well.

Why would I like something a bit more mechanistic than that then? I came up with the example in Figure 2 to explain why in the FDA hearing.


Fig 2: imagine dropping a mass of 1kg off a tower and timing how long it takes to fall to the ground. You might get the blue dots if you did it from between the second and third floors of a building. Left: if you were a statistician only you might then do a linear regression to estimate the relationship between height and time (think ion channel block and TdP risk). Right: a physicist would get out Newton’s II law and derive the relationship on the right. The one on the left would be dodgy for extrapolating outside the ‘training data’, the one on the right would be fairly reliable for extrapolating (not completely – as it doesn’t include air resistance etc.!)

Why might ion channel block be like Fig 2? Well when you’re considering just two or three channels being blocked, then Hitesh’s method (which actually includes a bit of mechanistic Newton’s II law in my analogy!) will work very well, assuming it’s trained on enough compounds of various degrees of block of the three channels, as the blue dots will cover most of the space.

But you might want to predict the outcome of block (or even activation) up to seven or more different ionic currents (and combinations thereof) that could theoretically happen and cause changes relevant to TdP risk. In this case, any method that is primarily based on a statistical regression, rather than mechanistic biophysics, is going to struggle because of the curse of dimensionality. In essence, you’ll struggle to get enough historical compounds to ‘fill the space’ for anything higher than two or three variables/ion currents. You could think of the biophysical models as reducing the dimension of the problem here (in the same way as the biology does, if we’ve got enough bits of the models good enough), so they can output a single risk marker that is then suitable for this historical correlation with risk – without a huge number of compounds.

The right balance?

CiPA is pursuing single-cell action potential simulations, looking for markers of arrhythmic risk in terms of quantifying ‘repolarisation stability’ in some sense. I think this is a very sensible approach, geared simply at improving one step on solely APD/QT.

In terms of including more risk factor details in here, as the committee asked originally at the top of this post, the real question is ‘does it improve my risk prediction?‘ or not. Hopefully I’ve explained why including all the detail we can think of isn’t obviously going to help. Your ranking of a new compound in terms of the risk of established ones would have to change in order for a new simulated risk marker to make any difference.

To assess whether that difference really was an improved risk prediction we would need to have faith that risk factors were widely applicable to the people who are getting TdP, and that any changes to the models for introducing heart failure etc. are sufficiently well validated and trusted to rely on them. I don’t think we are quite ready for this, as there is plenty to do at the moment trying to ensure there is an appropriate balance of currents in a baseline model (before any are blocked – two relevant papers: paper 1, paper 2), and that kinetics of drug-block of hERG included, as these are probably important.

Another thought along the committee’s lines is TdP risk for different patient subgroups, instead of a one-size-fits-all approach. This would be very nice, but the same difficulties apply, multiplied! Firstly, getting models that we trust for all these subgroups, with well quantified levels/distributions of ion channel expression and other risk-factor-induced changes. Secondly, even sparser gold standard clinical risk categorisation for all subgroups to test our models on. Unfortunately, with such a rare side effect it is difficult enough to get an overall risk level, never mind risk tailored to individual subgroups. So at present, I think the CiPA proposal of a single cell model (give or take an additional stem-cell derived myocyte prediction perhaps!) and single risk marker is a very sensible first step.

As usual, comments welcome below!

Posted in Action Potential Models, Drug action, Future developments, Model Development, Safety Pharmacology, Tissue Simulations | Tagged , , , , , , , | 1 Comment

Uncertainty quantification for ion channel screening and risk prediction

This post accompanies our new paper in Wellcome Open Research.

Regular readers of this blog will know that I worry about uncertainty in the numbers we are using to model drug action on electrophysiology quite a lot – see our recent white paper for an intro where we discuss the various sources of uncertainty in our simulations, and a 2013 paper on variability observed when you repeat ion channel screening experiments lots of times. That paper studied variability in the averages of lots of experiments, in contrast, it is also possible to look at the uncertainty that remains when you just do one experiment (or one set of experiments).

We have been looking at screening data that were published recently by Crumb et al. (2016). This is an exciting dataset because it covers an unprecedented number of ion currents (7), for a good number of compounds (30 – a fair number anyway!). The ideal scenario for the CiPA initative is that we can feed these data into a mathematical model, and classify compounds as high/intermediate/low arrhythmic risk as a result. Of course, the model simulation results are only ever going to be as accurate as the data going in, I’ve sketched what I mean by this in the plot below.

Uncertainty Quantification schematic plot

Figure 1: Uncertainty Quantification. Here we imagine we have two inputs (red, blue) to a nonlinear model, and we observe the resulting model outputs. Top: for just a single value (with no associated uncertainty). Middle: for two well-characterised (low uncertainty) inputs, these give us distinct probabilistic model predictions. Bottom: for two uncertain and overlapping inputs, inevitably giving us uncertain and overlapping output distributions.

In Figure 1 we can see the same scenario viewed three different ways. We have two inputs into the simulation – a ‘red’ one and a ‘blue’ one. You could think about these as any numbers, e.g. “% block of an ion current”.

Top row of Fig 1

If we ignored uncertainty, we might do what I’ve shown on the top row: plug in single values; and get out single outputs. Note that blue was higher than red and gives a correspondingly higher model output/prediction. But how certain were we that red and blue took those values? How certain are we that the output really is higher for blue?

Middle row of Fig 1

Instead of just the most likely values, it is helpful to think of probability distributions for red and blue, as I’ve shown in the middle row of Figure 1. Here, we aren’t quite sure of their exact values, but we are fairly certain that they don’t overlap (this step of working out distributions on inputs is called “uncertainty characterisation“), so their most likely values are the same as the ones we used before in the top row. When we map these through our model* (a step called “uncertainty propagation“) we get two probability distributions. N.B. these output distributions are not necessarily the same shape as the ones that went in, for a nonlinear model. In the sketch the output distributions don’t overlap much either. But this is only guaranteed to be true if the model output is a linear (with non-zero slope!) function of one input; in general, the output distributions could easily overlap more than the inputs [e.g. imagine the model output is a sine wave function of the input with red input = blue input + 2π]; which is why uncertainty propagation is an important exercise! The whole process of uncertainty characterisation and propagation is known as uncertainty quantification.

Bottom row of Fig 1

In the bottom row of Figure 1 we see another scenario – here there is lots of overlap between the red and blue input distributions. So we think blue is higher than red, but it could easily be the other way around. Now we map these through our model, and get correspondingly large and overlapping probability distributions on our outputs. Even in the best-case scenario (i.e. a linear model output depending solely on this input), we would end up with the same distribution overlap as the inputs; and for non-linear models, where outputs are complicated functions of many inputs, the distributions could overlap much more. So bear in mind that it isn’t possible to get less-overlapping distributions out of the model than the ones that went in, but it is possible to get more overlapping ones. The uncertainty always increases if anything (I hadn’t really thought about it before, but that might be related to why we think about entropy in information theory?).

If we consider now that the output is something like ‘safety risk prediction for a drug’, could we distinguish whether red or blue is a more dangerous drug? Well, maybe we’ll be OK with something like the picture in the middle row of Figure 1. Here my imaginary risk indicator model output distinguishes between the red and blue compounds quite well. But we can state categorically “no” if the inputs overlap to the extent that they do in the bottom row, before we even feed them in to a simulation. So we thought it would be important to do uncertainty characterisation and work out our uncertainty in ion channel block that we have from dose-response curves; before doing action potential simulations. This is the focus of our recent paper in Wellcome Open Research**.

Uncertainty Characterisation

In the new paper, Ross has developed a Bayesian inference method and code for inferring probability distributions for pIC50 values and Hill coefficients.

The basic idea is shown in Figure 2, where we compare the approach of fitting a single vs. distribution of dose-response curves:

Figure 2: maximum likelihood best fit versus inference of dose-response curves.

Figure 2: best fit versus inference of dose-response curves. On the right we have plotted lots of samples from the distribution of possible curves, with each sample equally likely, so the darker regions are the more likely regions. Data from Crumb et al. (2016).

On the left of Figure 2 we see the usual approach that’s taken in the literature, fitting a line of best fit through all the data points. On the right, we plot samples of dose-response curves that may have given rise to these measurements.

The method works by inferring the pIC50 and Hill coefficients that describe the curve, but also the observation error that is present in the experiment. i.e the ‘statistical model’ is:

data = curve(pIC50, Hill) + Normal noise (mean = 0, standard deviation = σ).

One of the nice features of using a statistical model like this is that it tries to learn how much noise is on the data by looking at how noisy the data are, and therefore generates dose-response curves spread appropriately for this dataset, rather than with a fixed number for the standard deviation of the observational noise.

At this point we calculate what the % block of the current (inputs into model) might look like from these inferred curves, this is shown in Figure 3:

Input distribution characterisation

Figure 3: uncertainty characterisation. When we go to do a simulation we’ll want the “% hERG block” as an input into the simulations. This figure shows how we work that out at a given concentration. Imagine drawing a line vertically in the top plot (same as right plot in Fig 2), now we just look at the distribution along this line, which is displayed in the bottom plot for two concentrations (green and pink-ish).

The new paper also extends this approach to hierarchical fitting (saying data from each experiment were generated with different pIC50 and Hill coefficients), but I will let you read the paper for more information on that.

So what can we conclude about the distribution of inputs for the Crumb dataset? Well, that’s a little bit tricky since it’s 7 channels and therefore a seven dimensional thing to examine. To have a look at it we took 500 samples of each of the seven ion current % blocks. This gives a table where each row looks like:

Drug (1-30) | Sample (1-500) | % block current 1 | % block current 2 | ... | % block current 7 |

So the seven % blocks are the seven axes or dimensions of our input dataset.

I then simply did a principal components analysis that will separate out the data points as much as possible by projecting the data onto new axes that are linear combinations of the old ones. You can easily visualize up to 3D, as shown in the video below.

In the video above you see the samples for each compound plotted in a different colour (the PCA wasn’t told about the different compounds). So each cloud is in a position that is determined by what % block of each channel the compounds produce at their free Cmax concentrations (given in the Crumb paper). What we see is that about 10 to 12 of the compounds are in distinct distributions, so they block currents in a combination unlike other compounds, i.e. behave like the different inputs in the middle row of Figure 1. But the remaining ones seem to block in distributions that could easily overlap, like the inputs in the bottom row of Figure 1. As you might expect, this is clustered close to the origin (no block) co-ordinate.

Here, these first three principal components happen to describe 94% of the variation in the full 7D dataset, so whilst it is possible that there is some more discrimination between the ‘clouds’ of samples for each compound in higher dimensions, your first guess would be that this is not likely (but this is a big caveat that needs exploring futher before reaching firm conclusions). So it isn’t looking promising that there is enough information here, in the way we’ve used it anyway, to distinguish between the input values for half of these compounds.

Uncertainty Propagation

But once you’ve got this collection of inputs you can simply do the uncertainty propagation step anyway and see what comes out. We did this for a simple O’Hara 1Hz steady state pacing experiment, applying conductance block according to the video’s cloud values for each of the 30 compounds, to work out predicted output distributions of APD90 for each compound. The results are shown in Figure 4.

APD90 distributions

Figure 4: distributions of simulated APD90 in the O’Hara model (endocardial variant) at free Cmax for all 30 compounds in the Crumb dataset, using samples from the inferred dose-response curves (Figure 15 in the paper). Random colours for each of the 30 compounds.

Figure 4 is in line with what we expected from the video above, so it causes us to have some pause for thought. Perhaps five compounds have distinctly ‘long’ APD, and perhaps two have distinctly ‘short’ APD, but this leaves 23 compounds with very much overlapping ‘nothing-or-slight prolongation’ distributions. The arrhythmic risk associated with each of these compounds is not entirely clear (to me!) at present, so it is possible that we could distinguish some of them, but it is looking a bit as if we are in the scenario shown in the bottom right of Figure 1 – and this output overlaps to such an extent that it’s going to be difficult to say much.

So we’re left with a few options:

  • Do more experiments (the larger the number of data points, the smaller the uncertainty in the dose-response curves), whether narrower distributions allow us to classify the compounds according to risk remains to be seen.
  • Examine whether we actually need all seven ion currents as inputs (perhaps some are adding noise rather than useful information on arrhythmic risk).
  • Get more input data that might help to distinguish between compounds – the number one candidate here would be data on the kinetics of drug binding to hERG.
  • Examine other possible outputs (not just APD) in the hope they distinguish drugs more, but in light of the close-ness of the input distributions, this is perhaps unlikely.

So to sum up, it is really useful to do the uncertainty characterisation step to check that you aren’t about to attempt to do something impossible or extremely fragile. I think we’ve done it ‘right’ for the Crumb dataset, and it suggests that we will need more, or less(!), or different data to distinguish pro-arrhythmic risk. Comments welcome below…


* The simplest way to do this is to take a random sample of the input probability distribution, run a simulation with that value to get an output sample, then repeat lots and lots of times to build up a distribution of outputs. This is what’s known as a Monte Carlo method (using random sampling to do something that in principle is deterministic!). There are some cleverer ways of doing it faster in certain cases, but we didn’t use them here!

** By the way: Wellcome Open Research is a new open-access, open-data, journal for Wellcome Trust funded researchers to publish any of their findings rapidly. It is a pre-print, post-publication review journal, so reviewers are invited after the manuscript is published and asked to help the authors revise a new version. The whole review process is online and open, and it follows the model. So something I’m happy we’ve supported by being in the first issue with this work.

Posted in Drug action, Model Development, Safety Pharmacology, Stats and Inference | Tagged , , , , , , , , , | 4 Comments

Ablation, scars and insights from parameter lumping

This is quite a long post elaborating on a bit of work I’ve been doing with Greg Morley and his group, which was published in Scientific Reports. I hope it provides an accessible introduction to a fairly complicated set of experiments and simulations.


Ablation is a treatment for some abnormal heart rhythms, which works by killing areas of the heart that we think are misbehaving and giving rise to abnormal rhythms. The idea behind ablation is that if you identify and kill a region of tissue that is allowing electrical signals to propagate ‘the wrong way’, then afterwards you will have created a non-conducting wall of scar tissue that prevents any abnormal electrical activity. As a treatment strategy, there’s clearly room for improvement; I hope we’ll soon look back on it as Dr McCoy does for dialysis in the scene in Star Trek IV. But at the moment, it’s the best we’ve got for certain conditions.

Usually, and ideally, you stop ablating when the arrhythmia terminates, so it is very effective in the short term. Unfortunately the success rate in the long term isn’t great – after a single round of atrial ablation people stay (for at least 3 years) arrhythmia free in only about 50% of cases1. So why do people re-lapse after ablation? Well it can be because electrical waves somehow get through the scar regions (see 1 for further references to this).

You also get similar scars forming after a heart attack – in this case the cardiac muscle itself doesn’t get a blood supply, and dies because of that, leaving a scar region. This damage leads to scars with larger border zones than in ablation, as in these border zones have low but not-zero blood supply, so you get semi-functioning tissue where some of the muscle cells (myocytes) survive, and others don’t. The problem now is the opposite to ablation – you’d like to restore conduction in these regions to get the rest of the heart beating as well as it can, but electrical activity is disturbed by the presence of these scar regions and their border zones, and arrhythmias become more likely (or even occur instantly in the case of big heart attacks).

Greg works on optical mapping for whole hearts in various species to study what happens around these scars. This is a clever technique where you put a dye into a tissue, it sticks to the membrane of cells, and when excited by an external light source, it emits light at an intensity dependent on the voltage across the cell membrane. So you can excite the dye with light at a certain wavelength, and record the light emitted at another wavelength in a camera to create a map of electrical activity in your sample, and you can do this in real time to see how electrical waves move across the heart. Here’s a simulation of electrical activity that Pras did with our Chaste software, and it includes a comparison with optical mapping at 39 seconds in the video below (this is for a completely different situation – just to show you how optical mapping works!):

Experimental Results

Greg found that in some intact hearts from mice recovering from ablation he could observe the optical mapping signal appearing to go straight through the middle of the scars. This is unexpected, to say the least, and required a lot of further investigation. I’ll let you read the full story in the paper, but suffice to say we confirmed that:

  • there aren’t any myocytes (normal heart muscle cells) left in the scars (shown by doing histology and using an electron microscope) and there are lots of what are largely fibroblasts left there, the cells in the scars are definitely ‘non myocytes’ anyway;
  • the observations aren’t just optical mapping artefacts (shown by doing direct micro-electrode recordings, showing that local electrical pulses from a suction electrode diffuse in, and even putting black foil in the middle of the hearts!);
  • the observations depend on electrical coupling between myocytes and fibroblasts, as it goes away if you don’t have this (shown by specifically breeding a strain of mice where you could knock out Connexin43 in just the non myocytes – clever stuff!). In fibroblasts, Connexin 43 is thought to form gap junctions electrically coupling them to only myocytes, not to other fibroblasts. If you knock out Connexin43 proteins in the non myocytes with this special mouse strain, then the optical mapping signal you record is much smaller in the scar region; the electrical signal no longer diffuses in from a suction electrode, indicating that fibroblast-myocyte junctions are required to see conduction into the scar.*

To quote from the article, this is “the first direct evidence in the intact heart that the cells in myocardial scar tissue are electrically coupled to the surrounding myocardium and can support the passive conduction of action potentials.”

Where mathematical modelling comes in

When Greg presented the experimental results at conferences he had a hard time convincing people that the experimental results were real, and not some odd artefacts of the experimental set up! It seemed that nobody expected electrical waves to travel through these regions, even if they were populated with cells, as these cells weren’t cardiac myocytes. It was counter-intuitive to most cardiac electrophysiologists that any signal would get across this gap, because that’s the whole point of ablation! This is where we thought it would be quite interesting to see what you would expect, from the standard model of electrical conduction, if there are neutral non-myocyte cells in a lesion that are coupled together. By neutral we mean not providing any ‘active’ transmembrane currents, and just providing passive electrical resistance.

Our standard simplest model for the reaction-diffusion of voltage in cardiac tissue is the monodomain equation (so-called because there’s an two-domain extension called the bidomain equation):

MonodomainThere are a few terms here that need defining: V is voltage across the cell membrane – which is the quantity that we consider to be diffusing around, rather than the individual types of ions themselves. Iion is the current across a unit surface area of the cell membrane at any point in space (varies across space), Istim is any external stimulus current which is applied to a unit volume of the tissue, Cm is the capacitance of a unit surface area of the membrane (the bigger this is, the more current is required to charge up a bit of membrane the same amount), and σ is the conductivity of the tissue – how easy it is for voltage to diffuse (expressed as a vector as there can be preferential conduction in different directions). χ is a scaling factor of surface-area-of-membrane-per-unit-volume-of-tissue, required to change the current across a unit-of-surface-area membrane into current across all-the-surface-area-of-membrane-in-a-unit-volume of tissue. So you can think of χ as the amount of membrane packed into a little cube of the tissue.

So, what will be different in the lesion?:

  • 𝜎 , the cell-cell conductivity, could be varying. This would represent how much Connexin was present to link the lesion cells together with gap junctions.
  • 𝜒 , the surface area of membrane in a unit volume of tissue, could also be varying. This would represent non-myocyte membrane density in the lesion.
  • 𝐶𝑚, the capacitance of a unit of membrane area, probably won’t change (membrane made of same stuff). Beware though, a maths/biology language barrier means experimentalists might call 𝜒 ‘capacitance’ to confuse us all!
  • Iion will be small compared to myocytes, these cells aren’t actively trying to propagate electrical signals. So we did the study twice, once with this set to zero in the scar, and once with it set to use a previously published model of fibroblast electrophysiology (didn’t make any visual difference to the results).

Now in the scar region we aren’t applying an external stimulus so Istim = 0, and we can divide through by χ to get:

Monodomain2A nice property springs out – that the entire behaviour of the scar region (in terms of difference to the rest of the heart) is determined by the ratio of 𝜎. So we introduce a factor ρ that will scale this quantity.

Even before we run any simulations, we’ve learnt something here by writing down the model and doing this “parameter lumping” (see nondimensionalisation for a framework in which to do this kind of thing rigorously!). Just by looking at ρ = 1 we see that there are an infinite number of ways we could get exactly the same behaviour. The scar cells could be just as well coupled (𝜎) and just as densely packed (χ) as the rest of the heart (incredibly unlikely to Greg), or they could be 1% as well coupled with 1% as much membrane present (more plausible to Greg); before we even do a simulation we can state definitively that this would give exactly the same behaviour: as ρ = 1 in both cases, and we’re solving the same equations! So this scaling is interesting, as we didn’t know whether the value of ρ would be increased or decreased in the scar, despite strong suspicion that the cells in the scar will have both reduced gap junctions and reduced membrane surface area available.

So we did a simulation with a mouse ventricular myocyte model, on a realistic sized bit of tissue 5mm x 5mm, with a 2mm diameter lesion in the middle. There are no preferential fibre directions here, something that could easily make the conduction more or less likely.

So what behaviours did we predict with different values of ρ? First for ρ = 1; the 𝜎/χ takes the same value as usual case (even though both could be reduced/increased by any factor):

In the video we see that conduction does indeed ‘get across’ the scar, even though it is not ‘driven’ as such as it is in the rest of the tissue, but instead is simply diffusing across that region. We predict the wave will get across a 2mm scar easily with this value of ρ.

What about with more membrane (or less conductivity): ρ = 0.1 (Remembering this would happen whenever 𝜎/χ is ten times less that the normal muscle: so both when 𝜎 is 1% of normal and χ is 10% of normal; or when 𝜎 is 0.1% of normal and χ is 1% of normal, for instance).

This time, you can see that the wave struggles to get across the lesion, but the membrane voltage still gets high, and that would probably still be high enough to record in optical mapping.

With less membrane density relative to the well-coupled-ness of the lesion cells (ρ = 10)? (Remembering this would happen whenever 𝜎/χ is ten times more that the normal muscle: so both when 𝜎 is 10% of normal and χ is 1% of normal; or when 𝜎 is 1% of normal and χ is 0.1% of normal, for instance). We see that the wave even appears to accelerate across the lesion region and conduction sets off earlier than before at the far side, as there is less membrane to charge so it is easier to do it:

So our conclusion was that it’s perfectly possible that a voltage signal could be recorded in the lesion, and a voltage signal could effectively travel straight through the scar, and conduction carry on out the other side. We’re not entirely sure about the value that ρ should take – but this behaviour was fairly robust, and matched what we saw in the experiments.

This simple model predicted many of the features of the recordings that were made from the scar region – see Figure 7 in the paper, and compare with experiments in Figure 4. So, it helped Greg answer accusations of “This is impossible!” that he got when he presented stuff at conferences, as he could reply that “If the cells have gap junctions, even without any voltage-gated ion channels of their own, this is exactly what we’d expect – see!”.

As usual, all the code is up on the Chaste website if anyone wants to have more of a play with this.


1 “Long-term Outcomes of Catheter Ablation of Atrial Fibrillation: A Systematic Review and Meta-analysis” Ganesan et al. J Am Heart Assoc.(2013)2:e004549 doi:10.1161/JAHA.112.004549

*Incidentally, I’d like to do a sociology experiment where I give biologists and mathematicians  logic problems to do mentally. My hypothesis is that biologists would beat mathematicians, as they are always carrying around at least ten “if this then that” assumptions in their heads. Then I’d give them a pen and paper and see if the situation reversed…

Posted in Action Potential Models, Future developments, Model Development, Tissue Simulations | Tagged , , , , , , , , , , | Leave a comment

Should I work with IC50s or pIC50s?

When you are expressing how much a drug inhibits something, it’s common to fit a Hill curve through a graph of concentration against % inhibition as shown here:


Figure 1: A concentration-effect curve showing % inhibition as a function of drug concentration. The Inhibitory Concentration 50% value is simply defined as the concentration that causes 50% block, here I’ve expressed it as the ratio of the drug concentration over IC50, so IC50 is defined to be 1 here. Notice we’ve plotted this on a log scale, which perhaps answers the question in the title already…

In our case this is often ‘% inhibition’ for a given ionic current, a consequence of a drug molecule binding to, and blocking, ion channels of a particular type on a cell’s membrane.

A while back we were interested what the distribution of IC50s would be if you repeated an experiment lots and lots of times. We asked AstraZeneca and GlaxoSmithKline if they had ever done this, and it turned out both companies had hundreds, if not thousands, of repeats as they did positive controls as part of their ion channel screening programmes. We plotted histograms of the IC50 values and Hill coefficients from the concentration – effect curves they had fitted, and found these distributions:


Figure 2: Histograms of pIC50s and Hill coefficients fitted to over 12,000 repeats of a screen examining whether Cisapride inhibits the hERG potassium channel current. The red lines are distributions fitted to these distributions: logistic for pIC50 and log-logistic for Hill coefficients.

After some investigation, we found the standard probability distributions that both the IC50s and Hill coefficients from these experiments seemed to follow very well were Log-logistic.

Now, where do pIC50s come in? A pIC50 value is simply a transformation of an IC50, defined with a logarithm as:

pIC50 [log M] = -log10(IC50 [M]) = -log10(IC50 [μM]) + 6

At this point it is good to note that logarithms to the base 10 (also known as common logarithms) were invented by Henry Briggs who, like me, was born in Halifax!

The ‘pIC50’ is analogous to ‘pH‘ in terms of its negative log relationship with Hydrogen ion concentration, which is more familiar to most of us.

The logarithm means that pIC50s end up being distributed Logistically, as:

ln(X) ~ Loglogistic(α, β).                       ...(1)

implies that

X ~ Logistic(μ, σ),                              ...(2)

as shown below:

Figure 3: Logistic pIC50 and Loglogistic IC50s

Figure 3: Logistic pIC50 and Log-logistic IC50s. Histogram of samples from the same simulated distribution pIC50 ~ Logistic(6, 0.2), with red lines indicating fitted Logistic and Log-logistic distributions (see note on implementation below!).

You can see how nicely these distributions work across a lot of different ion currents and drug compounds in the Elkins et al. paper supplement.

More recently, we ran into an interesting question: “Given just a handful of IC50 values  should we take the mean value as ‘representative’, or the mean value once they are converted to pIC50s”?

Well, ideally, you’d try and do rigorous inference on the parameters of the underlying distribution – as we did in the original paper – and as our ApPredict project will try to do that for you if you install that. To do that across multiple experiments you probably want to use mixed-effect or hierarchical inference models. But it’s fair enough to want a rough and ready answer in some cases, especially if you don’t know much about the spread of your data. A tool to try and infer the spread parameter σ from a decent number of repeated runs of the same experiment is something I’ll try to provide soon.

But, let’s just say you want to have this ‘representative’ effect of a drug, given a handful of dose-response curves. You’ve got a few options: you could take a load of IC50 values, and take the mean of those; or you could take a load of pIC50 values, and take the mean of those. Or perhaps the median values? But which distribution should you use? Which would be more representative, and what is the behaviour you’re looking for? Answering this was a bit more interesting and involved than I expected…

First off, let’s look at some of the properties of the distributions shown in Figure 3. Here are the theoretical properties of both distributions (N.B. there are analytic formulae for these entries, which you can get off the wikipedia pages for each distribution (Loglogistic, Logistic). I converted the answers back to pIC50, for easy comparison, but the IC50s were really taken from the right hand distribution in Figure 3 in IC50 units.).

Theoretical Results Mean Median Mode
From IC50 distbn 5.836 6 (μ) 6.199
From pIC50 distbn 6 (μ) 6 (μ) 6 (μ)

So as you might expect (from looking at it), there’s certainly a skew introduced into the Loglogistic distribution on the right of Figure 3 for the IC50 values.

So what to measure depends what kind of ‘representative’ behaviour we are after. Or in other words, what is a drug really doing when we get these distributions for observations of its action? Well, you really want to be inferring the ‘centering’ distribution parameter (μ), which in this case would be the mean/median/mode pIC50 or the median IC50. You will already get the impression that it’s most useful to think in pIC50s – as the distribution is symmetric the mean, median and mode are all the same.

But what about the more realistic case of the properties of a handful of samples of those distributions? I just simulated that and show the results in Figure 4 (I think you could probably do this analytically on paper, given time, but I haven’t yet!).

Figure 4:distributions of (Top row) Mean, and (Bottom row) median from N=3 samples of (Left) pIC50 and (Right) IC50. Simulated by taking 3 samples from Figure 3, and repeating the process a million times to build up these histograms. Note that the mean can be taken from IC50 or pIC50 transformed variables, and these give different answers.

It seems the only estimates for which you are likely to get back an unbiased and consistent estimate is the mean/median of pIC50 values, since the distribution is symmetric. The IC50 distribution isn’t symmetric, and so taking the mean of IC50 samples leads to a bias, it does however seem to give you a good estimate for the median of the IC50 distribution (better than the median of a sample of IC50s does!) for a low N – see top right plot of Figure 4. As N increases you do eventually get a distribution whose peak is at the mean, but N needs to be quite a lot larger than your average (no pun intended) experimental N. As you might expect, the median pIC50 is not quite as good a measure as the mean for the centre of the pIC50 distribution (but that’s hard to see visually in Fig 4, it does almost as well here).

You could point out that none of these “make that much difference” to the plots above, but you will introduce a bias if you use the wrong statistic, and for the semi-realistic distributions that we’ve got as an example here, your estimate of pIC50 = 5.836 versus the true pIC50 = 6 does give a block error of almost 10% when you substitute it back into a Hill curve of slope 1 at 1μM.

Importantly, pIC50s are much nicer numbers to deal with: they are always given in the same units of log M, and you can recognise at a glance for your average pharmaceutical compound whether you’re likely to have no block (<1 ish), very weak activity (2-3ish) which is perhaps just noise, low block (4-5ish) or strong block (>6ish) – give or take the concentration that the compound is going to be present at – see Figure 1! These easy-to-grasp numbers make it much easier to spot typos than it is when you’re looking at IC50s in different units. We’ve also seen parameter fitting methods that struggle with IC50s, but are happier working in log-space with pIC50s, as searching (in some sense ‘linearly’) for a pIC50 within say 9 to 2 is easier than searching for an IC50 in 1nM to 10,000μM.

So my conclusion is that I’ll try and work with pIC50s, and if you need a quick summary statistic, use the mean of pIC50s. They seem to occur in symmetric distributions with samples that behave nicely, and therefore are generally much easier to have sensible intuition about!

Continue reading

Posted in Drug action, Safety Pharmacology, Stats and Inference | Tagged , , , , , , , , , , , , | 1 Comment

Is your model really ‘comprehensive’?

I’ve noticed more and more papers using the phrase “comprehensive model”. This phrase grates every time, and this post is about why.

I thought this might just be me getting old and grumpy, so I plotted the mentions of the phrases “comprehensive mathematical model” or “comprehensive model” in ‘Topic’ in Web of Science over the years in Figure 1. Sure enough, more papers in recent years feature models that are comprehensive*.

Figure 1: The Rise of Comprehensive Models!

Figure 1: the rise of Comprehensive Models! Mentions of “Comprehensive Models” or “Comprehensive Mathematical Models” in Web of Science 1945 – 2015.

So why don’t I think a mathematical model is ever comprehensive?

Usually by comprehensive the authors mean something like this describes most of what we’ve seen well enough to say/predict something (or even to within the observational error bounds in some physics experiments). Perhaps their model integrates information/theories that weren’t part of a single conceptual framework/mathematical model before. That’s great – but it isn’t comprehensive!

This post is really an excuse to plug and discuss the following quotation from James Black (physiologist and Nobel Prize winner for developing the first beta-blockers). He summarises what mathematical models are and aren’t, and what they are for, beautifully:

[Mathematical] models in analytical pharmacology are not meant to be descriptions, pathetic descriptions, of nature; they are designed to be accurate descriptions of our pathetic thinking about nature. They are meant to expose assumptions, define expectations and help us to devise new tests.

Sir James Black (1924 – 2010)
Nobel Prize Lecture, 1988**

A mathematical model isn’t supposed to be a comprehensive representation of a system – it’s always going to be a pathetic representation in many ways! Models make simplifying assumptions (by definition***), generally ignoring things that we think will make a smaller difference to the model’s predictions than the things that we have included (usually things that happen really fast/slow or that are really small/big).

What models do allow us to do is see exactly what we would expect to happen if the system works in the simple(ish!) way we think it does. Then that can teach us loads about whether the system does work as we thought it did, or whether something fundamental is missing from our understanding.

We’ll always be able to come back and add more detail as time goes on, we learn more, and can measure more things. So that means that a model is never finished, and never comprehensive. So I’d say let’s avoid using the word comprehensive to describe any kind of model!


A Large Confession: I thought the word comprehensive implied ‘includes everything’. Purely to back up my point I looked it up, and sure enough one of the OED’s definitions is “grasps or understands (a thing) fully” which a model never does; but another is “having the attribute of comprising or including much; of large content or scope” which might be OK! Hmmmm… given the ambiguity I still conclude we’d better avoid the word anyway! But I’ll let you make your own minds up:



* Yeah, I know I really need to divide by the total number of papers that mention models etc. but that’s harder to search sensibly – and WoS doesn’t summarise data for over 10,000 papers!
** You can watch his Nobel Prize lecture online. He’s quite a remarkable man and invented/discovered lots of classes of completely new drugs, interestingly – as you can see in the video – he used modelling a lot. His modelling would now be part of the trendy new field of Quantitative Systems Pharmacology (although I’d say we’ve been doing it for years 😉 ), and if QSP people ever have a prize for anything I think they should name it after him!
*** ‘Model’ means a simplification of reality, so it’s confusing to say they are “comprehensive” whilst there are lots of things we know about that aren’t included. I double-checked this one too, and indeed our kind of model is defined as “A simplified or idealized description or conception of a particular system, situation, or process, often in mathematical terms, that is put forward as a basis for theoretical or empirical understanding, or for calculations, predictions, etc.“.
Posted in Blog, Model Development, Silly | Tagged , , , , , | Leave a comment

Are 30 compounds enough to test our simulations?

I’ve just contributed to a new review ‘Recent developments in using mechanistic cardiac modelling for drug safety evaluation’, which discusses where we are with using mathematical models of the electrical activity of cardiac muscle cells to predict whether new drugs will cause any dangerous changes to heart rhythm. Quite a bit of the review brings up challenges that we’re going to come across as we assess whether or not the simulations are ready to help regulators decide on the safety of new pharmaceutical compounds.

How do we quantify ‘predictive power’?

As part of this review I got thinking about how reliable our assessments of simulation predictions are. We often think about evaluating predictions in terms of binary (yes/no) outcomes: e.g. do we keep developing a drug or not; do we allow a drug onto the market or not? To look at predictions of this kind of thing we often use a confusion matrix, also known as a contingency matrix/table, as shown in Figure 1.

Confusion Matrix

Figure 1: a ‘confusion matrix’, adapted from Wikipedia. For us the ‘Condition’ / Gold Standard would be the experimental measurement, and the ‘Test Outcome’ would be the simulation prediction.

In Kylie‘s study evaluating simulation predictions of whether or not compounds caused 10% or more prolongation of the rabbit wedge QT interval, we looked at a decent number of compounds, 77 for simulations based on PatchXpress automated patch clamp technology (and even more for other sources of screening data). Here’s a slightly more elaborate confusion matrix from that study:

A confusion matrix for our rabbit wedge study

Table 1: Simulation performance when evaluating predictions of whether or not a drug compound will cause over 10% prolongation of the ECG waveform in the rabbit wedge, the numbers are the numbers of validation compounds which fall into each box. Note this isn’t a standard confusion matrix, as we’ve got three categories instead of two, and I’ve got the rows and columns swapped relative to the one above(!) sorry. But you can see the calculations that give us a sensitivity, specificity and accuracy.

In Table 1 I separated out the predictions into three categories (>= 10% shortening, >= 10% prolongation, or ‘no effect’ as between the two), to allow us to evaluate predictions of shortening as well. But you just need to add up the relevant boxes as shown to evaluate predictions of prolongation. So here we do quite well, and it feels as if we’ve got enough compounds in the validation set to say something concrete.

But when we’re evaluating a new simulation’s performance, how many validation compounds do we need? Clearly more than 1 or 2, since you could easily get these right by accident, or design. But are 10 validation compounds enough? Or do we need 20?, 30?, 50?, 77 as we used here?, 100?, or even more?

How does our evaluation depend on the number of validation compounds?

In Figure 2 I’ve plotted how our estimates of the prediction sensitivity and accuracy change (and converge to a roughly steady value) as we increase the number of validation compounds that we consider. Here the estimates happen to ‘level out’ at about N=60 compounds. But by simply permuting the order in which we consider these 77 compounds, and repeating the process, we get another possible graph (two permutations shown in Figure 2).

graphs of sensitivity and accuracy converging with the number of compounds we validate against

Figure 2: sensitivity and accuracy with increasing numbers of compounds. Two random permutations of the 77 available compounds are considered.

So if we didn’t have the big picture here, and only had 30 compounds to work with (solid vertical lines in Fig. 2), we could equally well conclude that the sensitivity of the simulation prediction is 35% (which you’d consider rubbish) or 65% (pretty good)! So we have a problem with N=30 validation compounds.

Can we estimate our uncertainty in the performance of our model?

How do we know how far away the ‘real’ sensitivity and accuracy is likely to be? And by ‘real’ I mean the sensitivity and accuracy if we had an infinitely large set of validation compounds to test.

Luckily, the stats for this is well established (and Dave remembered it when we were doing Kylie’s study!). These measures are a sum of binary yes/no measurements, i.e. do we get the prediction right or not? This means each validation compound can be thought of as a sample from a binomial distribution for the sensitivity or accuracy, and the confidence interval for the underlying parameter ‘p’ (in this case the sensitivity or accuracy themselves) has a nice simple analytic expression first worked out by Wilson in 1927. So this post is certainly nothing new for a statistician! But you very rarely see these confidence intervals presented alongside assessments of sensitivity, specificity or accuracy. Perhaps this is because the confidence intervals often look quite big?!?

So I’ve plotted the Wilson’s Score Interval on top of the sensitivity and accuracy graphs for the two permutations of the compounds in Figure 3, which also appears in the new review paper.

Sensitivity and accuracy with increasing numbers of compounds. Two random permutations of the 77 available compounds are considered.

Figure 3: same as Figure 2, but this time with 95% confidence intervals overlaid, calculated using Wilson’s score interval.

Fig. 3 nicely shows how the intervals narrow with increasing numbers of compounds, and also how the 95% confidence intervals do indeed include the N=77 performance evaluations (our best guess at the ‘real’ numbers) about 95% of the time. So now when we look at just 30 compounds we still get the 35% or 65% sensitivity estimates, but we know that the ‘real’ sensitivity measure could easily be somewhere in 12-63% for the blue compounds, or 35-85% for the red ones, and indeed the N=77 estimate is at 60% – within both of these. In fact, the confidence interval applies at N=77 too, so the ‘real’ value (if we kept evaluating more and more compounds) could still be anywhere in approximately 41-77%; the accuracy estimate has narrowed more, to around 63-82%.

Note that because the sensitivity calculation, shown in Fig. 1, effectively operates with a smaller sample size than the accuracy (just the experimental prolonging compounds instead of all of them) you get wider confidence intervals. Note this means you can’t tell a-priori what number of compounds you’ll need to evaluate, unless you have a fair idea of the positive/negative rate for the Gold Standard measurements.

If you are getting appropriately sceptical, you’ll now wonder why I only showed you two permutations of the compound ordering, isn’t this as bad as only looking at two compounds? Quite right! So here are 50 random permutations overlaid for Sensitivity, Specificity, Cohen’s Kappa value, and Accuracy:

Sensitivity, Specificity, Kappa and Accuracy

Figure 4: 50 repeats of this permutation of compound ordering exercise, with black dots being the assessed predictivity index, and overlaid confidence intervals for each (click for a bigger version, sorry I couldn’t make MatLab play nicely with the sizes of the axes labels).

So, is testing 30 compounds enough?

Well, you can use as many or few compounds as you like, as long as you evaluate the confidence in your prediction scores (sensitivity, specificity, accuracy etc.)! This little exercise has persuaded me it’s vital to present confidence alongside the prediction scores themselves (something I’ve not always done myself). In terms of getting confidence intervals that are the kind of width we want for deciding whether or not predictions of QT prolongation are good enough (probably down to around 10%), this little exercise suggests we probably need to be doing validation studies on around 100 compounds.

Posted in Action Potential Models, Drug action, Ion Channel Models, Safety Pharmacology, Stats and Inference | Tagged , , , , | 7 Comments

Model Complexity

We’ve been thinking about how to parameterise/train/calibrate models, and also how to select an appropriate model to begin with. This raises all sorts of interesting questions on the details, but this post is just about setting out a big overview of how this sort of thing might work, in an ideal scenario.

We’ll go through the concepts with the help of the diagram I’ve drawn in Figure 1. We initially sketched this out for Kylie‘s thesis, she’s the one doing all the work on this for the hERG channel models at the moment, but it didn’t make the final cut there, so here it is instead.

A diagram showing how these concepts are related.

Figure 1: model complexity. The relationship between the complexity of your model (* relative to the available training data’s information content); the error you will get in calibration/training; the uncertainty in parameters (the inverse of ‘identifiability’ if you like, how ill-posed the parameter fitting problem is); and the fit that you will get when predicting validation data.

First off, if you have an overly simple model, sitting at the top of Figure 1, you probably won’t have much ambiguity in the parameter values it should take for the best fit to training data, but the fit to training data isn’t likely to be very good, and the prediction of new cases in validation is likely to be worse.

But, it is always possible to get a better fit to training data by increasing the complexity of your model (adding free parameters, extra equations etc.):

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”
John von Neumann

In statistical models/machine learning approaches this is called “overfitting”, but the same applies to adding lots of complexity (postulated reactions or ion channel states etc.) into biophysical models. For some examples of overfitting see a google image search on ‘overfitting’ to give you a good idea how it happens, and why it might lead to really bad predictions later on.

Figure 2: Overfitting. From Twitter, possibly attributable to @ResearchMark, but I couldn’t find the original source, sorry!

So you can make your model overly-complicated, but this is subtle: there’s the possibility that if you try and fit too many parameters to a dataset that isn’t “information rich” enough, you’ll also run into problems. I’ve shown an example of this in Figure 3, now we’re at the bottom of Figure 1, the model is too complex for the training data to constrain in panel A, and so gives rubbish predictions in panel B (and a whole range of predictions depending at random on our initial guess at the parameters! This is a bad thing). But if we change the training data to a more ‘information-rich’ protocol that allows us to identify the parameters better, then we can make good predictions and we are back in the ‘optimal’ setting. So “Model Complexity” really means ‘model complexity relative to a given training protocol’.

A diagram showing how fits can be perfect, but predictions don't need to be.

Figure 3: Predicting with too much uncertainty in parameters (click to see a bigger version). This is an example using synthetic data generated from the ten Tusscher IKr model, but that doesn’t really matter for now. Blue dots – data, Red lines – fits (in A and C) and predictions (in B and D). A & B are from one fitting procedure – fitting to data in A, and predicting data in B, repeated from 20 different initial guesses at parameters. You get 20 different optimal parameter sets, all of which give a great fit to training data (A), but then give 20 different predictions in validation data (B), with inset showing how bad it can be. Conversely, if we train to the data in C, 20 different guesses again, then all parameters converge to the same guess (there is low uncertainty in parameters), and all predict the same thing, the right behaviour, in D. For more info on this see Fink & Noble (2009), and Fink et al. (2011) where a version of this figure first appeared.

So that was an example of a model with too many free parameters to constrain in a certain situation. It is quite possible, and easy, to write down a model that has too many free parameters to constrain with any of your available data. Or you could even write down a whole range of models of high complexity that fit the data perfectly, but will lead to different predictions. What should you do in this case? I am still a fan of simplifying the model down to one with identifiable parameters*, and testing out the simplest model that’s a decent fit with identifiable parameters. I make that sound easy, but in general it isn’t.

So which model should you use? Well, things like Information Criteria will try and let you guess from just the training data fit and model complexity. But you’ve no guarantee that they are going to accurately predict the profile of your validation error on the right of Figure 4, although that is what they’re effectively trying to do. So it is absolutely crucial to do the last step of making some validation predictions.

Making validation predictions sounds simple enough, but opens up more questions: how do I choose which validation experiment to perform? If I choose one that is too simple/similar to the training/calibration experiment, then I could trick myself into thinking I’m doing really well (and kid myself that the green shape on the right is really the same shape as the red training triangle). Most of the literature will tell you to do something based on the ‘context of use’, that is, situations as similar as possible to where you want to use the model ‘for real’. e.g. if I want to use a model to predict cardiac effects at 1Hz, then test it out at just 1Hz. This makes perfect sense for statistical/machine learning models.

With biophysical models, I don’t think it is that simple, and as usual it probably depends on what you are making the model for. If you want to predict some behaviour in future, then yes, do the ‘context of use’ simulations to see how well you are likely to do. But you might want to know whether your biophysical model is a good description of the biophysics – perhaps the biophysics itself is what you are trying to learn about, or perhaps you can’t be sure what the ‘context of use’ is (voltage waveforms in arrhythmias?). The biophysics should hold – not only outside the training data regime – in any regime where the underlying assumptions you built in about physical processes hold. In this case, perhaps it makes sense to throw the wackiest validation protocol you can think of at the model, to see if it really does capture the underlying biophysics? So we’re trying that, and I’ll let you know how we get on!

You also have to avoid falling into the trap of using validation predictions to perform the model selection for you. Sounds like an obvious thing to do, but then you have turned your ‘validation data’ into ‘model selection training data’! This is easy to do by accident, and really if you do this you should have a third set of data – sometimes called ‘test data’ to see how well the calibrated and selected model performs. Often machine learning people do work with three datasets like this – see Chapter 7 of Hastie et al.’s Elements of Statistical Learning book for a good intro.

*As an aside, people often talk about a model being “identifiable” (i.e. “the parameters of the model are identifiable”). But we have to remember that only makes any sense as shorthand for the statement “the parameters of the model are identifiable, given this training/calibration data“. Which immediately raises the question of “What if we calibrate the parameters to the results of a different experiment?”, as the example in Figure 3 shows, which is getting me into the whole new field of optimal experimental design…

Posted in Action Potential Models, Ion Channel Models, Model Development, Stats and Inference | Tagged , , , , , , , , | Leave a comment