*(Subtitle: “when you’ve got a lot of data points!”)*

This post is for people interested in doing optimisation or inference with Differential Equation (DE) models.

If you are a statistician, you might be used to treating model simulators as black boxes where you can stick parameters in and get outputs out. This post is about why you need to be a bit careful with that. It examines one of the quirks of working with differential equations and optimisation/inference that my team have bumped into in a few distinct situations – including simulators given out for public optimisation competitions! I haven’t seen it referred to in any of the textbooks, but please let me know in the comments if you have.

Below in Figure 1 is a likelihood surface (or objective function) that we came across (more on the definition of it below), as a function of one of the parameters in a cardiac action potential model. We are trying to find the maximum in this case.

Not all optimisers rely on a nice smooth gradient – but they do all enjoy them! This is a horrible surface and no matter what kind of optimiser you use it is going to struggle to move around and explore something that looks like this. The red line marks the data-generating value in this case, and the green is somewhere we got stuck. Remember this is only in one dimension, now imagine it in ten or more…

To make matters worse, we might want to run MCMC on this surface to get a posterior distribution for the parameter on the x-axis. We see that there are ‘spikes’ of about 40 log-likelihood units. What does that mean? Well if we are talking about the probability of accepting a trough from a spike in Figure 1 using an MCMC Metropolis-Hastings step, that equates to an acceptance ratio of exp(-40) = 4×10^-18 ! Our chains will certainly get stuck and never move across this space nicely.

Is the problem really so non-linear that is has got thousands of local minima, or modes in a posterior, as this suggests? Thankfully, the answer is ‘No’!

After a bit of detective work we figured out that this bumpy surface is entirely due to numerical error in our simulation, and it should be completely smooth! The example is from an Ordinary Differential Equation (ODE) solver but Partial Differential Equation (PDE) solvers will also give the same behaviour.

Most of the time we can’t derive exact analytic solutions to our models’ equations, so we have to use numerical solution techniques; the simplest of these is the Forward Euler method. These numerical methods give you only an approximation to the solution of your equations, which you try to ensure is accurate by taking more computational effort by adding steps in your approximation (finer time steps) and checking the solution is converging to an answer. As you keep refining, the solution should change less and less.

Broadly speaking we can classify the different ODE solvers into: ** fixed step**, like the Forward Euler method, that take the same size time steps as they go along; and

**that alter the length of time steps, possibly on every step. When gradients are changing fast adaptive solvers try to take lots of small steps to stay accurate; when gradients are changing more slowly they make fewer but larger steps to run computations fast.**

*adaptive step*With an adaptive time-step solver you give a target tolerance (relative to the size of the variables (RelTol), or absolute (AbsTol), or typically both) and it refines the steps to try to maintain these tolerances on each step. In the example here we used CVODE but another common one is the Matlab ode15s stiff ODE solver. The same principle would also apply if you use a fixed-step solver, it would need smaller time steps rather than tighter tolerances.

In Figure 2 we show the shift in the likelihood surface as we tighten the ODE solver tolerances (Relative, Absolute in brackets above each plot):

In general RelTol = 10^-4 and AbsTol = 10^-6 are not unreasonable choices for a single ODE solve, indeed Matlab’s *defaults* are RelTol = 10^-3 (less precise than Figure 1) and AbsTol = 10^-6 (the same).

So why is this effect so big?

**Likelihoods**

A very common assumption is that a ‘data generating process’ (the way that you end up with observations that some instrument records) is:

data = reality + observation noise on each data point

Another common assumption is that the noise here is Gaussian, **independent** on each data point and **identically-distributed** (comes from a Normal distribution with the same mean (often zero) and standard deviation), this is known as “**i.i.d.**” Gaussian noise.

A third assumption is that ‘reality’ in our equation above is given by the smooth noise-less model output. This is obviously a bit shaky (because no model is perfect), but the idea is you can still get useful information on the parameters within your model if it is close enough (N.B. bear in mind you might get overconfident in the wrong answer – this is a good paper explaining why). So we then commonly have:

data = model output + i.i.d. Gaussian noise.

We can then write down a log-likelihood (log just because it is easier to work with numerically…) and we end up with a big sum-of-square errors across all of our time trace:

(see the Wikipedia derivation from the Normal probability density function). Here we take the mean to be the model output given some parameter set; x to be the observed data points and sigma is the i.i.d. noise parameter.

The reason that we have come across this problem perhaps more than other people isn’t that we have been more sloppy with our ODE solving (we put some effort into doing that relatively well!), but that we are dealing with problems that consist of high frequency samples of time-series data. We commonly work with a few seconds of 10kHz time sampled recordings, so we can end up with around 100,000 data points.

Why is this important? Say your simulation and data diverge by >=1.1 standard deviations of the noise level (P<0.86 in a statistics table) instead of >= 1 standard deviation (P<0.84) because of numerical error. If this happens at 100 time points then your probabilities multiply and become 0.86^100 = 5×10^-7 and 0.84^100 = 3×10^-8. It has become almost ten times less likely that your parameters gave rise to the data because of your numerical error that had a relatively small effect on the solution at each time point. As we have more and more data points, this effect is exaggerated until even tiny shifts in the solution have huge effects on probabilities, as we saw above.

There’s a slight subtlety here: you might have already checked that your solution is converging to within a pre-specified tolerance *for a given parameter set. *For example a modeller might say “I don’t care about changes of less than 0.01% in these variables, so I set the solver tolerance accordingly” then a statistician treating the simulator as a black box might just run with that. But what is important here is not the error bound on the individual variables at a given parameter set, but the error bound that the likelihood transformation of these variables demands in terms of reducing jumps in likelihood *as a function of parameters*. So the modeller and statistician need to talk to each other here to work out whether there might be problems…

**Conclusions**

I wouldn’t be surprised to find that this is one of the reasons people have found the need to use things like genetic algorithms in cardiac problems. But I suspect the information content, un-identifiability and parameter scalings are also very important factors in that.

So what should you do?

Examine * 1D likelihood slices.* We can fix all parameters and vary one at a time, plotting out the likelihood as above. Then tighten your solver tolerances until 1D slices of your likelihood are smooth enough for optimisers/MCMC to navigate easily. Whatever this extra accuracy costs in additional solver time will be compensated in far more efficient optimisation/inference (in the examples we have looked at, the worst cost is approximately just 10% more solve time for a solve with 10x tighter tolerances, resulting in thousands of times speed up in optimisation).

What about * thinning the data*? A way to get rid of this problem would be to remove a lot of data points. Something that’s called ‘thinning’ in the MCMC literature (although it usually refers to the MCMC chain afterwards rather than the data). I’m not a fan of doing it to the data. It will artificially throw away information and make your posteriors wider than they should be according to your noise model. You might not completely trust your likelihood/noise model, but thinning doesn’t automatically fix it either!

Finally, this post wouldn’t be complete without mentioning that there is a relatively new way to consider this effect, which explicitly admits that we have error from the solver, and treats it as a random variable (which can be correlated through time):

data = model + numerical approximation error + observation noise.

Dealing with this formulation is the field known as *probabilistic numerics* – see the homepage for this, and you can use it to make MCMC take account of numerical errors. In our case, I expect this approach could help by effectively warming up (c.f. tempering methods) the likelihood and making the spikes relatively smaller and more jump-able. Interestingly, in the above plots you can see that this isn’t independent noise as you move through parameter space, I don’t know enough about the subject to say whether that has been handled or not! Whether it is worth the extra complication I’m not convinced. Maybe for big PDE models it will be worth the trouble, but for the reasonably lightweight ODEs involved in single cell cardiac work it is probably just worth solving more accurately all the time.

Hi Gary,

An interesting blog post as always. I agree that the root of your problem is that your ‘noise’ model is assumed to independent identically distributed. I don’t think you have to go to probabilistic numerics to resolve this. Instead, you need to consider the model error (which can include numerical error). The model error will not be independent at each of your data points. Instead of

data = model output + i.i.d. noise

consider

data = model output + model error + i.i.d noise

As the model error is not assumed to be identically distributed your likelihood will now feature a model error covariance matrix. You can estimate this matrix either before or during inference. There are various ways to do this, though they can be difficult. The key point, as you’ve already observed, is that you will see a high covariance between model errors at closely spaced data points. Incorporating the model error covariance into your likelihood calculation will naturally correct for the issue you’ve observed.

In practical usage you may find it difficult to obtain a sufficiently accurate estimate of your model error covariance matrix to just plug it into your likelihood calculation. In this case it is perfectly valid to thin your data to remove highly correlated points. The crucial difference is that the thinning is not arbitrary, or necessarily uniform, but driven by your estimate of the model error covariance. Doing this correctly you will throw away a minimal amount of information and resolve your problem.

Interesting Raf thanks – I’ve seen Gaussian Processes (through time) used to describe this sort of ‘model discrepancy’ term (Kennedy & O’Hagan style). Or are you considering a single scalar additive random variable (that describes error in the overall likelihood?) drawn from a distribution whose properties are a function of parameters?

I thought this post really interesting. It seems to me that modellers like yourself can have a much stronger intuition for how to solve these kinds of problems than statisticians.

I found this statement a bit surprising,

“Whatever this extra accuracy costs in additional solver time will be compensated in far more efficient optimisation/inference”

I am guessing this was true for the models that you are working with, but I wonder how true it is in general?

If an ODE system has chaotic dynamics, wouldn’t you expect that small changes in parameter values could lead to large changes in the likelihood? In other words, I think there may be situations where, no matter how much you tighten your tolerances, the likelihood function will still be very noisy.

Yeah, this sweeping generalisation was based on thinking about having to find the maximum on the plots in Figure 2. Trivial in the high tolerance case (relatively few objective function calls [ODE solves] required), but almost impossible in the low tolerance case (thousands of objective function calls required), so you can afford to take a bit longer to solve the ODEs on each objective function call to make the optimisation easy.

Afraid I haven’t played with chaotic systems enough to have any intuition on that at the moment. I suspect that a smart choice of objective function is very important here, and matching every time point on a dense time series probably isn’t the right way to go about it.

Some interesting comments from Chris Rackauckas who develops the Julia ODE solvers on a twitter thread above this tweet: https://twitter.com/ChrisRackauckas/status/1053018882972106752

Pingback: Three potassium channel modelling papers | Mathematical Matters of the Heart

Pingback: Our new review on fitting cardiac models | Mathematical Matters of the Heart