When you are expressing how much a drug inhibits something, it’s common to fit a Hill curve through a graph of concentration against % inhibition as shown here:
In our case this is often ‘% inhibition’ for a given ionic current, a consequence of a drug molecule binding to, and blocking, ion channels of a particular type on a cell’s membrane.
A while back we were interested what the distribution of IC50s would be if you repeated an experiment lots and lots of times. We asked AstraZeneca and GlaxoSmithKline if they had ever done this, and it turned out both companies had hundreds, if not thousands, of repeats as they did positive controls as part of their ion channel screening programmes. We plotted histograms of the IC50 values and Hill coefficients from the concentration – effect curves they had fitted, and found these distributions:
After some investigation, we found the standard probability distributions that both the IC50s and Hill coefficients from these experiments seemed to follow very well were Log-logistic.
Now, where do pIC50s come in? A pIC50 value is simply a transformation of an IC50, defined with a logarithm as:
pIC50 [log M] = -log10(IC50 [M]) = -log10(IC50 [μM]) + 6
The ‘pIC50’ is analogous to ‘pH‘ in terms of its negative log relationship with Hydrogen ion concentration, which is more familiar to most of us.
The logarithm means that pIC50s end up being distributed Logistically, as:
ln(X) ~ Loglogistic(α, β). ...(1)
X ~ Logistic(μ, σ), ...(2)
as shown below:
You can see how nicely these distributions work across a lot of different ion currents and drug compounds in the Elkins et al. paper supplement.
More recently, we ran into an interesting question: “Given just a handful of IC50 values should we take the mean value as ‘representative’, or the mean value once they are converted to pIC50s”?
Well, ideally, you’d try and do rigorous inference on the parameters of the underlying distribution – as we did in the original paper – and as our ApPredict project will try to do that for you if you install that. To do that across multiple experiments you probably want to use mixed-effect or hierarchical inference models. But it’s fair enough to want a rough and ready answer in some cases, especially if you don’t know much about the spread of your data. A tool to try and infer the spread parameter σ from a decent number of repeated runs of the same experiment is something I’ll try to provide soon.
But, let’s just say you want to have this ‘representative’ effect of a drug, given a handful of dose-response curves. You’ve got a few options: you could take a load of IC50 values, and take the mean of those; or you could take a load of pIC50 values, and take the mean of those. Or perhaps the median values? But which distribution should you use? Which would be more representative, and what is the behaviour you’re looking for? Answering this was a bit more interesting and involved than I expected…
First off, let’s look at some of the properties of the distributions shown in Figure 3. Here are the theoretical properties of both distributions (N.B. there are analytic formulae for these entries, which you can get off the wikipedia pages for each distribution (Loglogistic, Logistic). I converted the answers back to pIC50, for easy comparison, but the IC50s were really taken from the right hand distribution in Figure 3 in IC50 units.).
|From IC50 distbn||5.836||6 (μ)||6.199|
|From pIC50 distbn||6 (μ)||6 (μ)||6 (μ)|
So as you might expect (from looking at it), there’s certainly a skew introduced into the Loglogistic distribution on the right of Figure 3 for the IC50 values.
So what to measure depends what kind of ‘representative’ behaviour we are after. Or in other words, what is a drug really doing when we get these distributions for observations of its action? Well, you really want to be inferring the ‘centering’ distribution parameter (μ), which in this case would be the mean/median/mode pIC50 or the median IC50. You will already get the impression that it’s most useful to think in pIC50s – as the distribution is symmetric the mean, median and mode are all the same.
But what about the more realistic case of the properties of a handful of samples of those distributions? I just simulated that and show the results in Figure 4 (I think you could probably do this analytically on paper, given time, but I haven’t yet!).
It seems the only estimates for which you are likely to get back an unbiased and consistent estimate is the mean/median of pIC50 values, since the distribution is symmetric. The IC50 distribution isn’t symmetric, and so taking the mean of IC50 samples leads to a bias, it does however seem to give you a good estimate for the median of the IC50 distribution (better than the median of a sample of IC50s does!) for a low N – see top right plot of Figure 4. As N increases you do eventually get a distribution whose peak is at the mean, but N needs to be quite a lot larger than your average (no pun intended) experimental N. As you might expect, the median pIC50 is not quite as good a measure as the mean for the centre of the pIC50 distribution (but that’s hard to see visually in Fig 4, it does almost as well here).
You could point out that none of these “make that much difference” to the plots above, but you will introduce a bias if you use the wrong statistic, and for the semi-realistic distributions that we’ve got as an example here, your estimate of pIC50 = 5.836 versus the true pIC50 = 6 does give a block error of almost 10% when you substitute it back into a Hill curve of slope 1 at 1μM.
Importantly, pIC50s are much nicer numbers to deal with: they are always given in the same units of log M, and you can recognise at a glance for your average pharmaceutical compound whether you’re likely to have no block (<1 ish), very weak activity (2-3ish) which is perhaps just noise, low block (4-5ish) or strong block (>6ish) – give or take the concentration that the compound is going to be present at – see Figure 1! These easy-to-grasp numbers make it much easier to spot typos than it is when you’re looking at IC50s in different units. We’ve also seen parameter fitting methods that struggle with IC50s, but are happier working in log-space with pIC50s, as searching (in some sense ‘linearly’) for a pIC50 within say 9 to 2 is easier than searching for an IC50 in 1nM to 10,000μM.
So my conclusion is that I’ll try and work with pIC50s, and if you need a quick summary statistic, use the mean of pIC50s. They seem to occur in symmetric distributions with samples that behave nicely, and therefore are generally much easier to have sensible intuition about!
A note on implementation
Now Matlab defines these arguments of (2) to be exactly the same μ and σ as (1), but another common parameterisation is that α = exp(μ) and β = 1/σ.
Since our pIC50s are in log10 we need to use the following identity:
logb(a) = logd(a) / logd(b) = ln(a) / ln(10)
Note that if we follow the Matlab way of doing things, the following relationships are useful:
IC50 [μM] ~ Loglogistic(μ, σ) implies that pIC50 [log M] ~ Logistic(6 - μ/ln(10), σ/ln(10));
or the other way,
pIC50 [log M] ~ Logistic(μ, σ) implies that IC50 [μM] ~ Loglogistic(ln(10)*(6 - μ), ln(10)*σ).
Note that in Matlab you type ‘log’ for ‘ln’ and ‘log10’ for log10. I’ve uploaded the matlab script I used for this post here.