(Subtitle: “when you’ve got a lot of data points!”)
This post is for people interested in doing optimisation or inference with Differential Equation (DE) models.
If you are a statistician, you might be used to treating model simulators as black boxes where you can stick parameters in and get outputs out. This post is about why you need to be a bit careful with that. It examines one of the quirks of working with differential equations and optimisation/inference that my team have bumped into in a few distinct situations – including simulators given out for public optimisation competitions! I haven’t seen it referred to in any of the textbooks, but please let me know in the comments if you have.
Below in Figure 1 is a likelihood surface (or objective function) that we came across (more on the definition of it below), as a function of one of the parameters in a cardiac action potential model. We are trying to find the maximum in this case.
Not all optimisers rely on a nice smooth gradient – but they do all enjoy them! This is a horrible surface and no matter what kind of optimiser you use it is going to struggle to move around and explore something that looks like this. The red line marks the data-generating value in this case, and the green is somewhere we got stuck. Remember this is only in one dimension, now imagine it in ten or more…
To make matters worse, we might want to run MCMC on this surface to get a posterior distribution for the parameter on the x-axis. We see that there are ‘spikes’ of about 40 log-likelihood units. What does that mean? Well if we are talking about the probability of accepting a trough from a spike in Figure 1 using an MCMC Metropolis-Hastings step, that equates to an acceptance ratio of exp(-40) = 4×10^-18 ! Our chains will certainly get stuck and never move across this space nicely.
Is the problem really so non-linear that is has got thousands of local minima, or modes in a posterior, as this suggests? Thankfully, the answer is ‘No’!
After a bit of detective work we figured out that this bumpy surface is entirely due to numerical error in our simulation, and it should be completely smooth! The example is from an Ordinary Differential Equation (ODE) solver but Partial Differential Equation (PDE) solvers will also give the same behaviour.
Most of the time we can’t derive exact analytic solutions to our models’ equations, so we have to use numerical solution techniques; the simplest of these is the Forward Euler method. These numerical methods give you only an approximation to the solution of your equations, which you try to ensure is accurate by taking more computational effort by adding steps in your approximation (finer time steps) and checking the solution is converging to an answer. As you keep refining, the solution should change less and less.
Broadly speaking we can classify the different ODE solvers into: fixed step, like the Forward Euler method, that take the same size time steps as they go along; and adaptive step that alter the length of time steps, possibly on every step. When gradients are changing fast adaptive solvers try to take lots of small steps to stay accurate; when gradients are changing more slowly they make fewer but larger steps to run computations fast.
With an adaptive time-step solver you give a target tolerance (relative to the size of the variables (RelTol), or absolute (AbsTol), or typically both) and it refines the steps to try to maintain these tolerances on each step. In the example here we used CVODE but another common one is the Matlab ode15s stiff ODE solver. The same principle would also apply if you use a fixed-step solver, it would need smaller time steps rather than tighter tolerances.
In Figure 2 we show the shift in the likelihood surface as we tighten the ODE solver tolerances (Relative, Absolute in brackets above each plot):
In general RelTol = 10^-4 and AbsTol = 10^-6 are not unreasonable choices for a single ODE solve, indeed Matlab’s defaults are RelTol = 10^-3 (less precise than Figure 1) and AbsTol = 10^-6 (the same).
So why is this effect so big?
A very common assumption is that a ‘data generating process’ (the way that you end up with observations that some instrument records) is:
data = reality + observation noise on each data point
Another common assumption is that the noise here is Gaussian, independent on each data point and identically-distributed (comes from a Normal distribution with the same mean (often zero) and standard deviation), this is known as “i.i.d.” Gaussian noise.
A third assumption is that ‘reality’ in our equation above is given by the smooth noise-less model output. This is obviously a bit shaky (because no model is perfect), but the idea is you can still get useful information on the parameters within your model if it is close enough (N.B. bear in mind you might get overconfident in the wrong answer – this is a good paper explaining why). So we then commonly have:
data = model output + i.i.d. Gaussian noise.
We can then write down a log-likelihood (log just because it is easier to work with numerically…) and we end up with a big sum-of-square errors across all of our time trace:
(see the Wikipedia derivation from the Normal probability density function). Here we take the mean to be the model output given some parameter set; x to be the observed data points and sigma is the i.i.d. noise parameter.
The reason that we have come across this problem perhaps more than other people isn’t that we have been more sloppy with our ODE solving (we put some effort into doing that relatively well!), but that we are dealing with problems that consist of high frequency samples of time-series data. We commonly work with a few seconds of 10kHz time sampled recordings, so we can end up with around 100,000 data points.
Why is this important? Say your simulation and data diverge by >=1.1 standard deviations of the noise level (P<0.86 in a statistics table) instead of >= 1 standard deviation (P<0.84) because of numerical error. If this happens at 100 time points then your probabilities multiply and become 0.86^100 = 5×10^-7 and 0.84^100 = 3×10^-8. It has become almost ten times less likely that your parameters gave rise to the data because of your numerical error that had a relatively small effect on the solution at each time point. As we have more and more data points, this effect is exaggerated until even tiny shifts in the solution have huge effects on probabilities, as we saw above.
There’s a slight subtlety here: you might have already checked that your solution is converging to within a pre-specified tolerance for a given parameter set. For example a modeller might say “I don’t care about changes of less than 0.01% in these variables, so I set the solver tolerance accordingly” then a statistician treating the simulator as a black box might just run with that. But what is important here is not the error bound on the individual variables at a given parameter set, but the error bound that the likelihood transformation of these variables demands in terms of reducing jumps in likelihood as a function of parameters. So the modeller and statistician need to talk to each other here to work out whether there might be problems…
I wouldn’t be surprised to find that this is one of the reasons people have found the need to use things like genetic algorithms in cardiac problems. But I suspect the information content, un-identifiability and parameter scalings are also very important factors in that.
So what should you do?
Examine 1D likelihood slices. We can fix all parameters and vary one at a time, plotting out the likelihood as above. Then tighten your solver tolerances until 1D slices of your likelihood are smooth enough for optimisers/MCMC to navigate easily. Whatever this extra accuracy costs in additional solver time will be compensated in far more efficient optimisation/inference (in the examples we have looked at, the worst cost is approximately just 10% more solve time for a solve with 10x tighter tolerances, resulting in thousands of times speed up in optimisation).
What about thinning the data? A way to get rid of this problem would be to remove a lot of data points. Something that’s called ‘thinning’ in the MCMC literature (although it usually refers to the MCMC chain afterwards rather than the data). I’m not a fan of doing it to the data. It will artificially throw away information and make your posteriors wider than they should be according to your noise model. You might not completely trust your likelihood/noise model, but thinning doesn’t automatically fix it either!
Finally, this post wouldn’t be complete without mentioning that there is a relatively new way to consider this effect, which explicitly admits that we have error from the solver, and treats it as a random variable (which can be correlated through time):
data = model + numerical approximation error + observation noise.
Dealing with this formulation is the field known as probabilistic numerics – see the homepage for this, and you can use it to make MCMC take account of numerical errors. In our case, I expect this approach could help by effectively warming up (c.f. tempering methods) the likelihood and making the spikes relatively smaller and more jump-able. Interestingly, in the above plots you can see that this isn’t independent noise as you move through parameter space, I don’t know enough about the subject to say whether that has been handled or not! Whether it is worth the extra complication I’m not convinced. Maybe for big PDE models it will be worth the trouble, but for the reasonably lightweight ODEs involved in single cell cardiac work it is probably just worth solving more accurately all the time.