Jeremie Coullon

Early Monte Carlo methods - Part 2: the Metropolis sampler

2021-06-23T08:00:00+00:00

This post is the second of a two-part series on early Monte Carlo methods from the 1940s and 1950s. In my previous post I gave an overview of Monte Carlo methods in the 40s and focused on the 1949 conference in Los Angeles. In this post I’ll go over the classic paper Equation of State Calculations by Fast Computing Machines (1953) by Nick Metropolis and co-authors. I’ll give an overview of the paper and its main results and then give some context about how it was written. I’ll then delve into some details of how the MANIAC computer worked to give an idea of what it must have been like to write algorithms such as MCMC on it.

The classic Metropolis sampler paper

Overview of paper

The objective of the paper is to estimate properties (in this case pressure) of a system of interacting particles. The system of study consists of \(N\) particles (which they model as hard disks) in a 2D domain. They mention however that they are working on a 3D problem as well. The potential energy of the system is given by:

\[E = \frac{1}{2} \sum_i^N \sum^N_{j, i\neq j} V(d_{ij})\]

Here \(d_{ij}\) is the distance between molecules and \(V\) is the potential between molecules.

It seems that a usual thing to do up to the 1950s to estimate properties for such complicated systems was to approximate them analytically. The paper compares results from MCMC with two standard approximations. They also mention that an alternative method would have to use ordinary Monte Carlo by sampling many particle configurations uniformly and weighing them by their probability (namely by their energy). The problem with this approach is that the weights would all be essentially zero if many particles are involved. Their solution is to rather do the opposite: sample particle configurations based on their probabilities and then take an average (with uniform weights).

After an overview of the problem they introduce the Metropolis sampler: for each particle suggest a new position using a uniform proposal and allow this move with a certain probability. Interestingly, this sampler would be called these days a Metropolis-within-Gibbs sampler or a single-site Metropolis sampler as each particle is updated one at a time. In figure 1 we see the description of the algorithm from the paper.

Figure 1: Description of the Metroplis sampler

The algorithm was coded up to run on the MANIAC computer, and took around 3 minutes to update all 242 particles (which is obviously slow by today’s standards). Note that they used the Middle Square method to generate the uniform random numbers in the proposal distribution and the accept-reject step. This random number generator has some issues but is fast and therefore much more convenient than reading in random numbers from a table (this method was introduced in the 1949 conference and is discussed in my previous post).

They then justify the new sampler by giving an argument for how it will converge to the target distribution: they show that the system is ergodic and that detailed balanced is satisfied. Finally, they run experiments and estimate the pressure of a system of particles. They compare the results to two standard analytic approximations, and find that the MCMC results agree with the approximations in the parameter region where they are known to be accurate.

Discussion

This paper explains clearly and simply their new sampler, includes some theoretical justification, and have experiments to show that the method can work in practice. One thing I’m not clear about is that they don’t have access to the ”ground truth”, so it’s not completely clear how we know that the MCMC results are correct. However they do explain how the analytic approximations diverge from the MCMC results exactly in the parameter regions where those approximations are expected to break down.

Another point is that they include some discussion of the Monte Carlo error, but they seem to compute the error using the variance of samples and not correct for the correlation between samples. We now know that we must calculate the integrated autocorrelation time and use it to find the effective sample size. So a nitpick of the paper would be that their Monte Carlo error estimate is too small! We’ll have to wait until Hasting’s paper (see section 2) in 1970 for a rigorous treatment of the variance of estimates from correlated samples.

Finally, they use 16 cycles as burn-in (one cycle involves updating all the particles) and then run 48 to 64 cycles to sample configurations (they run different experiments). I haven’t reproduced this experiment to see the trace plots but intuitively this seems fairly low. However they were limited by the computing power available to them and this must have been enough to get the estimates they were looking for.

An interesting article from 2004 includes the recollections from Marshall Rosenbluth (one of the authors) where he explained the contributions of each of the authors of the paper. It turns out that he and Arianna Rosenbluth (his wife) did most of the work. More specifically, he did the mathematical work, Arianna wrote the code that ran on the MANIAC, August Teller wrote an earlier version of the code, and Edward Teller gave some critical suggestions about the methodology. Finally, Nick Metropolis provided the computing time; as he was first author the method is therefore named after him. But perhaps a more appropriate name would have been the Rosenbluth algorithm.

MANIAC

A lot of the early Monte Carlo methods were intimately linked with the development of modern computers, such as the ENIAC and the MANIAC 1 (which was used in the Metropolis paper). It therefore makes sense to look into some detail how these computers worked to see what it must have been like to write code for them. We’ll look into some sections from the MANIAC’s documentation to get a taste for how this computer worked.

Historical context

The MANIAC computer was a step in a series of progressively faster and more reliable computers. Its construction started in 1946, was operational in 1952 (just in time for the 1953 Metropolis paper), and shut down in 1958. It was used to work on a range of problems such as PDEs, integral equations, and stochastic processes. It was succeeded by the MANIAC II (built in 1957) and the MANIAC III (1964) which were faster and easier to use. To give some context, Fortran came out in 1957, Lisp in 1958, and Cobol in 1959. So code written for the MANIAC was not at all portable; you had to learn how to program for this specific machine.

Arithmetic

We start by looking at the introduction (often a nice place to start) of the documentation (page 6 in the pdf). We find that numbers are represented as binary digits (which was to be expected). Note that they use the word bigit to mean a binary digit; it’s perhaps a shame that this term didn’t stick. I’ll use it throughout the rest of the post as I like it.

The storage capacity of the MANIAC is:

1 sign bigit
39 numerical bigits

We then put a decimal point before the first numerical bigit. The number 0.2 (in decimal) would then be represented on the MANIAC as 0.11 (binary). Note that this means that numbers can only be between \(-1\) and \(1\). So if your program will generate numbers outside this range you must either scale the numbers before doing the calculations or adjust the magnitudes of numbers on the fly.

Negative numbers

We now consider negative numbers on the MANIAC (page 7 of the pdf). The natural thing to do to generate a negative number would be to have the sign bigit be 0 for a positive number and 1 for a negative number (or vice versa). But the MANIAC represents the negative of the number \(x\) as the complement of \(x\) with respect to \(2\), namely:

\[c = 2 - |x|\]

As \(0 < x < 1\) we have \(1 < c < 2\). This means that the sign bigit will always be 1 for negative numbers, and that all the numerical bigits of \(c\) will be the bigits of \(x\) flipped.

To illustrate this, suppose that \(x = -.101110101...011\) (in binary). This means that it’s representation on the MANIAC will be c = 1.010001010...101. Note that the sign bigit is 1, and that all the digits are flipped with the exception of the last one (which is always 1). You can check that this is the case by calculating the difference \(2-x\) in binary, namely 10.000...000 - 0.101110101...011.

This way of representing negative numbers may feel convoluted, but taking the complement of a number is easy to do on the MANIAC; you simply flip the numerical bigits and change the sign bigit.

Subtraction

The benefits of writing negative numbers in this weird way become even more apparent when we consider subtraction. To subtract \(b\) from \(a\) you simply add \(a\) to the complement of \(b\). This means that \(a-b\) becomes \(a + (2-b)\) (assuming that \(a,b>0\)).

Let’s check that this works for \(a,b>0\). We have to consider two cases: \(a>b\) and \(a<b\).

first case: \(a>b\)

The first case to consider is when \(a>b\). As both numbers are between \(0\) and \(1\) we have:

\[\begin{align} 0 &< a - b < 1 \\ 2 &< 2 + (a - b) < 3 \end{align}\]

We represent the number \(2 + (a-b)\) in binary as 10.<numerical bigits>... However the leftmost bigit is outside the capacity of the computer so is dropped (which subtracts \(2\) from the result), and you end up with the correct number: \(a-b\).

second case: \(a<b\)

In this case we have:

\[\begin{align} -1 &< a - b < 0 \\ 1 &< 2 + (a - b) < 2 \\ 1 &< 2 - (b - a) < 2 \end{align}\]

The term \(2- (b-a)\) is simply the complement of \(b-a\), namely “negative” \(b-a\). This is also exactly the result we wanted!

The other cases - as recommended in the documentation (see figure 2 below) - are left as exercises to the reader.

Figure 2: the documentation's recommendation

So subtraction reduces to flipping some bigits and doing addition.

Modular arithmetic

Finally, this way of doing subtraction doesn’t feel so weird anymore if we think about modular arithmetic. For example if we are working modulo \(10\) and want to subtract \(3\) from a number, we can simply add the complement of \(3\) which is \(10-3=7\). Namely:

\[\begin{align} - 3 &\equiv 7 \hspace{2mm} (10) \\ x - 3 &\equiv x + 7 \hspace{2mm} (10) \\ \end{align}\]

This is exactly how the MANIAC does subtraction, but modulo \(2\) rather than \(10\).

Writing programs

The programmer had to be very comfortable with this way of representing numbers and be able to write complicated numerical methods such as MCMC or PDE solvers. They would describe every step of the program on a “computing sheet” before representing them on a punch card which was finally fed into the MANIAC. Figure 3 shows a sample from the documentation which describes the instructions need to implement subtraction.

Figure 3: The computing sheet for subtraction

The documentation then goes on to describe more complicated operations, such as taking square roots, as well as the basics of MANIAC’s internal design.

In conclusion, coding was complicated and required working with engineers as the machines broke down regularly. The only built-in operators were (+), (-), (*), (%), and inequality/equalities. Figure 4 (below) is a sample from the documentation and shows how this computer was an improvement from previous versions: it was so simple that even a programmer could turn in on and off!

Figure 4: an intuitive UI (page 291)

Conclusion

The development of Monte Carlo methods took off in the 1940s, and this work was closely tied to the progress of computing. The usual applications at the time were in particle and nuclear physics, fluid dynamics, and statistical mechanics. In this post we went over the classic 1953 paper on the Metropolis sampler and went over some details of the MANIAC computer. We saw how the computers used in scientific research were powerful (for the era), but were complicated machines that required a lot of patience and detailed knowledge to use.

Thanks to Gael Martin and Christian Robert for useful feedback on this post

Early Monte Carlo methods - Part 1: the 1949 conference

2021-06-23T08:00:00+00:00

This post is the first of a two-part series on early Monte Carlo methods from the 1940s and 1950s. Although Monte Carlo methods had been previously hinted at (for example Buffon’s needle), these methods only started getting serious attention in the 1940s with the development of fast computers. In 1949 a conference on Monte Carlo was held in Los Angeles in which the participants - many of them legendary mathematicians, physicists, and statisticians - shared their work involving Monte Carlo methods. It can be enlightening to review this research today to understand what issues these researchers were grappling with. In this post I’ll give a brief overview on computing and Monte Carlo in the 1940s, and then discuss some of the talks from this 1949 conference. In the next post I’ll go over the classic 1953 paper on the Metropolis sampler and give some details of the MANIAC computer.

Historical context: Monte Carlo in the 1940s

Computing during World War 2

During WW2, researchers at Los Alamos worked on building the atomic bomb which involved solving nonlinear problems (for example in fluid dynamics and neutron transport). As the project advanced they used a series of increasingly powerful computers to do this. Firstly, they used mechanical desk calculators; these were standard calculators at the time, but were slow and unreliable. Then then upgraded to using electro-mechanical business machines which were bulkier but faster. One application was to solve partial differential equations (PDEs). To solve a PDE numerically, one punch card was used to represent each point in space and time, and a set of punch card would represent the state of the system at some point in time. You would pass each set of card through several machines to solve the PDE forward in time. The machines would often break down and would have to be fixed which made doing any calculation laborious. Finally, John von Neumann recommended the scientists use the ENIAC computer which was able to do calculations an order of magnitude faster. You can find more details on computing at Los Alamos in this report by Francis Harlow and Nick Metropolis.

Monte Carlo in the 40s

A well known story about early Monte Carlo methods is about Stan Ulam playing solitaire while he was ill (see this article by Roger Eckhardt for more details). He wondered what the probability was of winning at solitaire if you simply laid out a random configuration. After trying to calculating this mathematically, he then considered laying out many random configurations and simply counting how many of them were successful. This is a simple example of the Monte Carlo method: using randomness to estimate a non-random quantity. Ulam - along with John von Neumann - then starting thinking how to apply this method to solve problems involving neutron diffusions. Von Neumann was important in developing new Monte Carlo methods and new computers as well as giving legitimacy to these new fields; that such a well respected scientist was interested in these things helped these fields become respectable among other scientists. At that time Monte Carlo methods mainly seemed to spread by word of mouth. For example we can see in figure 1 an extract from a letter written by von Neumann to Ulam in 1947 describing rejection sampling. You can see more of the letter in Eckhardt’s article.

Figure 1: sample of von Neumann's letter to Ulam describing rejection sampling.

The 1949 Conference on Monte Carlo

In 1949 a conference was held in Los Angeles to discuss recent research on Monte Carlo. This was the first time a lot of these methods were published: the speakers gave talks and some of these were written up in a report. The attendees of this conference included a lot of famous mathematicians, statisticians, and physicists such as von Neumann, Tukey, Householder, Wishart, Neyman, Feller, Harris, and Kac. The speakers introduced such sampling methods as the rejection sampler, the middle square method, and splitting schemes for rare event sampling (these three methods were actually suggested by von Neumann). There were also talks about applications in such areas as particle and nuclear physics as well as statistical mechanics.

We’ll now focus on two of the talks, both presenting techniques invented by von Neumann.

the Middle Square Method

Von Neumann introduced the Middle Square method, which was a way to quickly generate pseudorandom numbers. A standard way at the time of obtaining random numbers was to generate a list of uniform random numbers using some kind of physical process, then use that list of random numbers in calculations. However when using computers such as the ENIAC or MANIAC, reading this list into the computer was much too slow. It was therefore necessary to generate these random numbers on the fly, even if that meant obtaining random samples of lower quality.

The method works as follows:

Start from an integer with \(n\) digits (with \(n\) even):
Square the number
If the square has \(2n-1\) digits, add a leading zero (to the left of it) so that the new number has \(n\) digits
keep the \(n\) digits in the middle of the number.

For example, if we start from the seed \(a_0 = 20\) (2 digits) the algorithm runs as follows:

\(20^2=0400\), so \(a_1=40\)
\(40^2=1600\), so \(a_2 = 60\)
Then \(a_i = 60\) for \(i>2\). Here the process has converged and will only generate the number \(60\) from now on.

So we notice that we need to use a carefully chosen large seed (namely, at least 10 digits). If we don’t choose a good seed the numbers might quickly converge and the algorithm will keep on generating the same number for ever. A lot of careful work was therefore done on finding good seeds and doing statistical tests on the generated numbers to assess their quality. One of the talks in the conference discusses some tests done on this algorithm (see talk number 12 by George Forsythe).

Obviously today we have much better PRNGs, so we should use these modern methods and not the middle square process. But it is interesting to see some of the early work on PRNGs.

Rejection sampling

Von Neumann also introduce rejection sampling (though we saw in figure 1 that he had developed this method several years earlier). This talk is called “Various techniques used in connection with random digits” and a pdf of the talk (without the rest of the conference) can be found here. There also seems to be several dates attributed to this paper (see this stackexchange question about how to properly cite it). In the talk von Neumann first reviews methods used to generate uniform random numbers, then introduces the rejection sampling method and how to use it with a few examples.

Generating uniform random numbers

Von Neumann starts by considering that a physical process (such as nuclear process) can be used to generate high quality random numbers, and that a device could be built that would generate these as needed. However he points out that it would be impossible to reproduce the results (if you need to debug the program for example) as there would be no random seed that would reproduce the same random numbers. He concludes this by saying:

“I think that the direct use of a physical supply of random digits is absolutely inacceptable for this reason and for this reason alone”.

In light of this comment it is interesting to consider how most of modern research code involving random numbers does not keep track of the random seed used, and is therefore not reproducible in this sense (namely, in terms of the realisation of the random variables used).

He then considers generating random numbers using a physical process and printing out the results (this is discussed in another talk in the conference); one can then read in the random numbers to the computer as needed. However he points out that reading in numbers to a computer is the slowest aspect of it (in more modern terms, the problem would be I/O bound).

Finally, he concludes that using arithmetic methods such as the Middle Square method is a nice practical alternative that is both fast and reproducible. However one needs to check whether a random seed will generate random samples of sufficiently high quality. It is here that he famously says:

“Any one who considers arithmetic methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number - there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method”

However he takes the very practical approach of recommending we use such arithmetic methods and simply test the generated samples to make sure they are good enough for applications.

The rejection method

In the second part of the talk he considers the problem of generating non-uniform random numbers following some distribution \(f\), namely how to sample \(X \sim f\). This requires using uniformly distributed random numbers (generated using the middle square method for example) and transforming them appropriately.

He first considers using the inverse transform but considers this to be too inefficient. He then introduces the rejection method which is as follows:

choose a scaling factor \(a \in \mathcal{R^+}\) such that \(af(x) \leq 1\)
sample \(X, Y \sim \mathcal{U}(0,1)\)
Accept \(X\) if \(Y \leq af(X)\). Reject otherwise

This last step corresponds to accepting \(X\) with probability \(af(X)\) (which is between \(0\) and \(1\)).

Note that the more modern version of this method considers a general proposal distribution \(g\) (see these lecture notes or any textbook on Monte Carlo):

let \(l(x) = cf(x)\) be the un-normalised target density (\(f\) is the target, \(c\) is unknown)
Let \(M \in \mathcal{R}\) be such that \(Mg(x) \geq l(x)\)
draw \(X \sim g\) and compute: \(r = \frac{l(X)}{Mg(X)}\)
Accept \(X\) with probability \(r\)

Here we sample \(X\) from a general distribution \(g\) and similarly choose a constant \(M\) such that \(r\) is between \(0\) and \(1\).

Von Neumann then gives several examples such as how to sample from the exponential distribution by computing \(X = - \log(T)\) with \(T \sim Uni(0,1)\). However he considers it silly to generate random number and then plug them into a complicated power series to approximate the logarithm function. He therefore gives an alternative method to transforming the random number \(T\) by only taking simple operations rather than the logarithm.

Round table discussion

The conference ended with a round table discussion led by John Tukey which was documented in section 14 of the report. This was a mix of prepared statements as well as discussions between the participants.

To open the discussion, Leonard Savage read from Plutarch’s Lives; in particular a passage that talks about the siege of Syracuse where Archemedes built machines to defend the city. The passage discusses how Archemedes would not have built these applied machines if the king hadn’t asked him to; these machines were the “mere holiday sport of a geometrician”. Savage then reads about Archytas and Eudoxus - friends of Plato - who developed mathematical mechanics. Plato was apparently not happy about this as this development as:

[mechanics was] destroying the real excellence of geometry by making it leave the region of pure intellect and come within that of the senses and become mixed up with bodies which require much base servile labor.

As a result mechanics was separated from geometry and considered among the military arts. These passages illustrate how applied research and engineering have always been looked down upon by many theoretical researchers. In 1949 Monte Carlo methods and scientific computing were just getting developed so this would indeed have been the case.

Another interested portion of this discussion was given by John Wishart: he pointed out that he was impressed by the interactions between physicists, mathematicians, and statisticians. He considered that the different groups would be able to learn from each other. He also gave stories of how Karl Pearson was very practically minded and would regularly using his “hand computing machine” to solve integral which would help with his research. Pearson and his student Leonard Tippett also generated long lists of random numbers to use in research: these lists would allow them to estimate the sampling distribution of some statistics they were studying.

The rest of the discussion goes over different practical problems and the benefits of having interactions between statistics and physics.

Thoughts on this conference

There seemed to be a strong focus in the conference on practical calculations and experiments. The reading by Leonard Savage at the beginning of the round table discussion seems to reflect the general tone of the conference of being equally comfortable dealing with data and computers as well as maths and theory. Indeed computers at the time were unreliable and regularly broke down so researchers had to be very comfortable with engineering. Von Neumann’s remarks on pseudo random numbers also shows a very practical mindset of using “whatever works” rather than trying to find the “perfect” random number generator.

I also noticed that throughout the talks there was very little mention of Monte Carlo error. Researchers in the 40s had known for a long time about the central limit theorem (CLT), but I didn’t find any explicit link between a Monte Carlo estimate (which is simply an average of samples) and CLT which would have given them an error bar (perhaps this was too obvious to mention?). The main mention of Monte Carlo error I found was by Alston Householder who - in his talk - gives estimates along with the standard errors (page 8). The only other hint of this that I found is the sentence by Ted Harris in the discussion at the end of the conference where he is talking about two different ways of obtaining Monte Carlo samples to estimate some quantity:

… we know that if we do use the second estimate instead of the first and if we do it for long enough then we will come out with exactly the right answer. I am leaving aside the question of the relative variability of the two estimates.

My guess is that either the CLT was too obvious to mention, or that it was hard enough to simply estimate these quantities without also computing error bars. In conclusion, I would recommend reading through some of the talk in the report, in particular the round table discussion.

Conclusion

We saw how Monte Carlo methods took off in the 1940s and that a lot of these new methods and applications were presented in the 1949 conference. In my next post I’ll go over the classic 1953 paper on the Metropolis sampler and give some details of the MANIAC computer.

Thanks to Gael Martin and Christian Robert for useful feedback on this post

Ensemble samplers can sometimes work in high dimensions

2021-02-26T08:00:00+00:00

A few years ago Bob Carpenter wrote a fantastic post on how why ensemble methods cannot work in high dimensions. He explains how high dimensions proposals that interpolate and extrapolate among samples are unlikely to fall in the typical set, causing the sampler to fail. During my PhD I worked on sampling problems with infinite-dimension parameters (ie: functions) and intractible gradients. After reading Bob’s post I always wondered if there was a way around this problem.

I finally got round to working on this idea and wrote a paper (arXiv) with Robert J Webber about an ensemble method for function spaces (ie: infinite dimensional parameters) which works very well. This sampler is aimed at sampling functions when gradients are unavailable (due to a black-box code base or discontinuous likelihoods for example).

So does that mean that Bob was wrong about ensemble samplers? Of course not; it rather turns out that not all high dimensional distributions are the same. Namely: there is sometimes a low-dimensional subset of parameter space that represents all the “interesting bits” of the posterior that you can focus on.

After giving a problem to motivate function space samplers, I’ll introduce the functional ensemble sampler (FES). I’ll then discuss what this means for using gradient-free samplers such as ensemble samplers in high dimensional spaces. You can skip directly to the discussion section for a reply to Bob’s post, which can be read independently of the description of the FES algorithm.

Functional ensemble sampler

A motivational problem

Consider the 1D advection equation, a linear hyperbolic PDE. This PDE models how a quantity - such as the density of fluid - propagates through 1D domain over time. This PDE is a special case of more complicated nonlinear PDEs that arise in fluid dynamics, acoustic, motorway traffic flow, and many more applications. The equation is given as follows (with subscripts denoting partial differentiation):

\[\rho_t + c \rho_x = 0\]

Here \(\rho\) is density and \(c \in \mathcal{R}\) is the wave speed of the fluid. We define the initial condition \(\rho_0 \equiv \rho_0(x)\) to be the state of density of the fluid at time \(t=0\). This linear PDE simply advects the initial condition to the right or left of the domain depending on the sign of \(c\).

So the solution can be written as \(\rho(x,t) = \rho_0(x-ct)\). The following figure shows how an initial density profile is advected (ie: transported) to the right with wavespeed \(c=1\).

Figure 1: The left-most panel shows the initial condition. The next 2 panels show it being transported to the right with speed c=1

An inverse problem might then be: given some noisy observations of flow at a few locations (these could correspond to detectors along a pipe for example), recover the initial condition as well as the wave speed of the fluid. Note that flow is the product of density and speed: \(q = \rho c\). Using this relation and the solution of the PDE above, we can write the data model for a detector at location \(x\) measuring flow at time \(t\):

\[q(x, t) = c\rho_0(x-ct) + \xi\]

with \(\xi \sim \mathcal{N}(0,1)\) the observational noise.

To solve this inverse problem, we need to set a prior on both parameters, so we choose a uniform prior for the wave speed, and a Gaussian Process (GP) prior for the initial condition.

Sampling from function spaces

pCN

The most basic gradient free sampler defined on function space is preconditioned Crank Nicholson (pCN) (see this paper for an overview). This sampler makes the following simple proposal, with \(u\) current MCMC sample, \(\beta \in (0,1)\) and \(\xi \sim \mathcal{N}(0, \Sigma_0)\) a sample from the prior:

\[\tilde{u} = \sqrt{1-\beta^2}u + \beta \xi\]

The acceptance rate for this sampler is independent of dimension so is well suited for sampling function (though in practice they’re discretisations of functions). However this sampler can mix slowly if the posterior is very different from the prior, for example if some of the components of the function are very correlated or multimodal.

FES

The idea of the functional ensemble sampler is to do an eigenfunction expansion of the prior and use that to isolate a low-dimensional subspace that includes the “difficult bit” of the posterior. This means that we can represent functions \(u\) in this space by truncating the eigenexpansion to \(M\) basis elements: \(u_+ = \sum^{M} u_j \phi_j\). This subspace might have very correlated components, be nonlinear, and more generally be difficult to sample from.

Functions in the rest of the space (ie: the complementary subspace) can be represented by \(u_- = \sum_{M+1}^{\infty} u_j \phi_j\) and are assumed to look like the prior. We can therefore alternate using a finite dimensional sampler (we’ll use the affine invariant ensemble sampler) to sample from this space, and use pCN to sample from the complementary subspace.

So the functional ensemble sampler is a Metropolis-within-Gibbs algorithm that alternates sampling from the low dimensional space using AIES and the complementary space using pCN. You can find more detail of the algorithm in the paper (see algorithm 1).

Performance

We go back to the advection equation and try out the algorithm. Given the equation \(\rho_t + c\rho_x = 0\), the inverse problem consists of inferring the wave speed \(c\) and initial conditions \(\rho_0\) from 9 noisy observations of flow \(q\). These observations come from equally spaced detectors at several time points. We discretise the initial condition \(\rho_0\) using 200 equally spaced points, which we note is a dimension where ensemble methods would usually fail.

We compare FES to a standard pCN sampler which uses the following joint update:

Gaussian proposals for the wave speed \(c\)
pCN for the initial condition \(\rho_0(x)\)

We run the samplers and find that the ensemble sampler can be up to two orders of magnitude faster than pCN (in terms of IATs). We also find that there’s an optimal value of the truncation parameter \(M\) to choose to get the fastest possible mixing. You can find details of this study in the paper in section 4.1.

To understand why the posterior was so challenging for a simple random walk, we plot in figure 2 samples from \(\rho_0\) conditioned on three values of the wave speed \(c\).

Figure 2: Samples of the initial conditions given three values of the wave speed c.

This figure reveals the strong negative correlation between the wave speed and the mean of \(\rho_0\). This correlation between the two parameters can be understood from the solution of the PDE for flow as described earlier: \(q(x, t) = c\rho_0(x-ct)\). If the wave speed \(c\) increases, the mean of \(\rho_0\) must decrease to keep flow at the detectors approximately constant (and vice-versa). Since the pCN sampler does not account for this correlation structure, large pCN updates are highly unlikely to be accepted and the sampler is slow. In contrast, FES adapts to the correlation structure, eliminating the major bottleneck in the sampling.

Discussion

Bob Carpenter’s explanation of why ensemble methods fail in high dimensions is of course correct: ensemble methods will fail in high dimensions because interpolating and extrapolating between points will fall outside the typical set. However high dimensional spaces are not all high dimensional in the same way.

Throughout the post Bob uses a high dimensional Gaussian (ie: a doughnut) to illustrate when ensemble methods fail. Indeed, we statistician often use the Gaussian distribution to illustrate ideas or simplify reasoning. For example, a standard line of thinking when developing a new method might be to argue: “in the case of everything being Gaussian our method becomes exact..”. This makes sense because Gaussians are easy to manipulate and pop up everywhere because of the central limit theorem.

However, it can be unhelpful to build our mental models of difficult concepts - such as high dimensional distributions - solely on Gaussians. Indeed, difficult sampling problems are hard precisely because they are not Gaussian. This is similar to the problem of using model organsisms in biology to learn about humans. So in a way, Gaussians are the fruit flies of statistics.

High-dimensional distributions

There are many ways in which high dimensional distributions might be different from a spherical Gaussian which we can use to help with sampling. One way is that there might exist a low dimensional subspace that represents most of the “interesting bits” of the posterior. The rest of the space (the complementary subspace) is then relatively unaffected by the likelihood and therefore acts like the prior. This is similar to the idea behind PCA where the data mainly lives on a low dimensional subspace.

In our inverse problem involving functional parameters, the prior gives a natural way to find this low dimensional subspace. This is because the Gaussian process prior imposes a lot of structure on the parameter which allows the infinite dimensional problem to be well posed.

To be perfectly clear: if you have gradients, you should use them! In the function space setting there are many gradient-based methods that will do much better than this ensemble method, such as function space versions of HMC and MALA as well as dimensionality reduction methods. However in our paper we rather focused on the setting where gradients were unavailable. So gradient-free samplers are about doing the best you can with limited information.

Conclusion

Making MCMC work is about finding a good parametrisation. HMC uses Hamilton’s equations to find a natural parametrisation (ie: level sets). However, this can still break down with difficult geometries and another reparametrisation might be necessary.

If you don’t have gradients then ensemble methods can be a good solution (example: the emcee package). However if the dimension is too high then these will break down as discussed in Bob Carpenter’s post. Thankfully, this is not the end of the road for ensemble samplers! You’ll then need to think about your problem to identify natural groupings of parameters to apply your gradient-free sampler to. In this post I went over a practical way to do this in the case of infinite-dimensional inverse problems, yielding a simple but powerful gradient-free ensemble sampler defined on function spaces.

Thanks to Robert J Webber and Bob Carpenter for useful feedback on this post

How to add a progress bar to JAX scans and loops

2021-01-29T08:00:00+00:00

JAX allows you to write optimisers and samplers which are really fast if you use the scan or fori_loop functions. However if you write them in this way it’s not obvious how to add progress bar for your algorithm. This post explains how to make a progress bar using Python’s print function as well as using tqdm. After briefly setting up the sampler, we first go over how to create a basic version using Python’s print function, and then show how to create a nicer version using tqdm. You can find the code for the basic version here and the code for the tqdm version here.

Update January 2023: this is now available in a pip-installable package: JAX-tqdm

Setup: sampling a Gaussian

We’ll use an Unadjusted Langevin Algorithm (ULA) to sample from a Gaussian to illustrate how to write the progress bar. Let’s start by defining the log-posterior of a d-dimensional Gaussian and we’ll use JAX to get it’s gradient:

@jit
def log_posterior(x):
    return -0.5*jnp.dot(x,x)

grad_log_post = jit(grad(log_posterior))

We now define ULA using the scan function (see this post for an explanation of the scan function).

@partial(jit, static_argnums=(2,))
def ula_kernel(key, param, grad_log_post, dt):
    key, subkey = random.split(key)
    paramGrad = grad_log_post(param)
    noise_term  = jnp.sqrt(2*dt)*random.normal(key=subkey, shape=(param.shape))
    param = param + dt*paramGrad + noise_term
    return key, param


@partial(jit, static_argnums=(1,2,))
def ula_sampler(key, grad_log_post, num_samples, dt, x_0):

    def ula_step(carry, iter_num):
        key, param = carry
        key, param = ula_kernel(key, param, grad_log_post, dt)
        return (key, param), param

    carry = (key, x_0)
    _, samples = lax.scan(ula_step, carry, jnp.arange(num_samples))
    return samples

If we add a print function in ula_step above, it will only be called the first time it is called, which is when ula_sampler is compiled. This is because printing is a side effect, and compiled JAX functions are pure.

Basic progress bar

As a workaround, the JAX team has added the host_callback module (which is still experimental, so things may change). This module defines functions that allow you to call Python functions from within a JAX function. Here’s how you would use the id_tap function to create a progress bar (from this discussion):

from jax.experimental import host_callback

def _print_consumer(arg, transform):
    iter_num, num_samples = arg
    print(f"Iteration {iter_num:,} / {num_samples:,}")

@jit
def progress_bar(arg, result):
    """
    Print progress of a scan/loop only if the iteration number is a multiple of the print_rate

    Usage: `carry = progress_bar((iter_num + 1, num_samples, print_rate), carry)`
    Pass in `iter_num + 1` so that counting starts at 1 and ends at `num_samples`

    """
    iter_num, num_samples, print_rate = arg
    result = lax.cond(
        iter_num % print_rate==0,
        lambda _: host_callback.id_tap(_print_consumer, (iter_num, num_samples), result=result),
        lambda _: result,
        operand=None)
    return result

The id_tap function behaves like the identity function, so calling host_callback.id_tap(_print_consumer, (iter_num, num_samples), result=result) will simply return result. However while doing this, it will also call the function _print_consumer((iter_num, num_samples)) which we’ve defined to print the iteration number.

You need to pass an argument in this way because you need to include a data dependency to make sure that the print function gets called at the correct time. This is linked to the fact that computations in JAX are run only when needed. So you need to pass in a variable that changes throughout the algorithm such as the PRNG key at that iteration.

Also note also that the _print_consumer function takes in arg (which holds the current iteration number as well as the total number of iterations) and transform. This transform argument isn’t used here, but apparently should be included in the consumer for id_tap (namely: the Python function that gets called).

Here’s how you would use the progress bar in the ULA sampler:

def ula_step(carry, iter_num):
    key, param = carry
    key = progress_bar((iter_num + 1, num_samples, print_rate), key)
    key, param = ula_kernel(key, param, grad_log_post, dt)
    return (key, param), param

We passed the key into the progress bar which comes out unchanged. We also set the print rate to be 10% of the number of samples. Note that this would also work for lax.fori_loop except that the first argument of ula_step would be the current iteration number.

Put it in a decorator

We can make this even easier to use by putting the progress bar in a decorator. Note that the decorator takes in num_samples as an argument.

def progress_bar_scan(num_samples):
    def _progress_bar_scan(func):
        print_rate = int(num_samples/10)
        def wrapper_progress_bar(carry, iter_num):
            iter_num = progress_bar((iter_num + 1, num_samples, print_rate), iter_num)
            return func(carry, iter_num)
        return wrapper_progress_bar
    return _progress_bar_scan

Remember that writing a decorator with arguments means writing a function that returns a decorator (which itself is a function that returns a modified version of the main function you care about). See this StackOverflow question about this.

Putting it all together, the result is very easy to use:

@partial(jit, static_argnums=(1,2,3))
def ula_sampler_pbar(key, grad_log_post, num_samples, dt, x_0):
    "ULA sampler with progress bar"

    @progress_bar_scan(num_samples)
    def ula_step(carry, iter_num):
        key, param = carry
        key, param = ula_kernel(key, param, grad_log_post, dt)
        return (key, param), param

    carry = (key, x_0)
    _, samples = lax.scan(ula_step, carry, jnp.arange(num_samples))
    return samples

Now that we have a progress bar, we might also want to know when the function is compiling (which is especially useful when it takes a while to compile). Here we can use the fact that the print function only gets called during compilation. We can add print("Compiling..") at the beginning of ula_sampler_pbar and add print("Running:") at the end. Both of these will then only display when the function is first run. You can find the code for this sampler here.

tqdm progress bar

We’ll now use the same ideas to build a fancier progress bar: namely one that uses tqdm. We’ll need to use host_callback.id_tap to define a tqdm progress bar and then call tqdm.update regularly to update it. We’ll also need to close the progress bar once we’re finished or else tqdm will act weirdly. To do with we’ll define a decorator that takes in arguments just like we did in the case of the simple progress bar.

This decorator defines the tqdm progress bar at the first iteration, updates it every print_rate number of iterations, and finally closes it at the end. You can optionally pass in a message to add at the beginning of the progress bar.

There are details to make sure the progress bar acts correctly in corner cases, such as if num_samples is less than 20, or if it’s not a multiple of 20. Note also that tqdm is closed at the last iteration only after the parameter update is done.

def progress_bar_scan(num_samples, message=None):
    "Progress bar for a JAX scan"
    if message is None:
            message = f"Running for {num_samples:,} iterations"
    tqdm_bars = {}

    if num_samples > 20:
        print_rate = int(num_samples / 20)
    else:
        print_rate = 1 # if you run the sampler for less than 20 iterations
    remainder = num_samples % print_rate

    def _define_tqdm(arg, transform):
        tqdm_bars[0] = tqdm(range(num_samples))
        tqdm_bars[0].set_description(message, refresh=False)

    def _update_tqdm(arg, transform):
        tqdm_bars[0].update(arg)

    def _update_progress_bar(iter_num):
        "Updates tqdm progress bar of a JAX scan or loop"
        _ = lax.cond(
            iter_num == 0,
            lambda _: host_callback.id_tap(_define_tqdm, None, result=iter_num),
            lambda _: iter_num,
            operand=None,
        )

        _ = lax.cond(
            # update tqdm every multiple of `print_rate` except at the end
            (iter_num % print_rate == 0) & (iter_num != num_samples-remainder),
            lambda _: host_callback.id_tap(_update_tqdm, print_rate, result=iter_num),
            lambda _: iter_num,
            operand=None,
        )

        _ = lax.cond(
            # update tqdm by `remainder`
            iter_num == num_samples-remainder,
            lambda _: host_callback.id_tap(_update_tqdm, remainder, result=iter_num),
            lambda _: iter_num,
            operand=None,
        )

    def _close_tqdm(arg, transform):
        tqdm_bars[0].close()

    def close_tqdm(result, iter_num):
        return lax.cond(
            iter_num == num_samples-1,
            lambda _: host_callback.id_tap(_close_tqdm, None, result=result),
            lambda _: result,
            operand=None,
        )


    def _progress_bar_scan(func):
        """Decorator that adds a progress bar to `body_fun` used in `lax.scan`.
        Note that `body_fun` must either be looping over `np.arange(num_samples)`,
        or be looping over a tuple who's first element is `np.arange(num_samples)`
        This means that `iter_num` is the current iteration number
        """

        def wrapper_progress_bar(carry, x):
            if type(x) is tuple:
                iter_num, *_ = x
            else:
                iter_num = x   
            _update_progress_bar(iter_num)
            result = func(carry, x)
            return close_tqdm(result, iter_num)

        return wrapper_progress_bar

    return _progress_bar_scan

Although this progress bar is more complicated than the previous one, you use it in exactly the same way. You simply add the decorator to the step function used in lax.scan with the number of samples as argument (and optionally the messsage to print at the beginning of the progress bar).

@partial(jit, static_argnums=(1,2))
def ula_sampler_pbar(key, grad_log_post, num_samples, dt, x_0):
    "ULA sampler with progress bar"

    @progress_bar_scan(num_samples)
    def ula_step(carry, iter_num):
        key, param = carry
        key, param = ula_kernel(key, param, grad_log_post, dt)
        return (key, param), param

    carry = (key, x_0)
    _, samples = lax.scan(ula_step, carry, jnp.arange(num_samples))
    return samples

Conclusion

So we’ve built two progress bars: a basic version and a nicer version that uses tqdm. The code for these are on these two gists: here and here.

MCMC in JAX with benchmarks: 3 ways to write a sampler

2020-11-10T08:00:00+00:00

This post goes over 3 ways to write a sampler using JAX. I found that although there are a bunch of tutorials about learning the basics of JAX, it was not clear to me what was the best way to write a sampler in JAX. In particular, how much of the sampler should you write in JAX? Just the log-posterior (or the loss in the case of optimisation), or the entire loop? This blog post tries to answer this by going over 3 ways to write a sampler while focusing on the speed of each sampler.

I’ll assume that you already know some JAX, in particular the functions grad, vmap, and jit, along with the random number generator. If not, you can check out how to use these in this blog post or in the JAX documentation! I will rather focus on the different ways of using JAX for sampling (using the ULA sampler) and the speed performance of each implementation. I’ll then redo these benchmarks for 2 other samplers (MALA and SLGD). The benchmarks are done on both CPU (in the post) and GPU (in the appendix) for comparison. You can find the code to reproduce all these examples on Github.

Sampler and model

To benchmark the samplers we’ll Bayesian logistic regression throughout. As sampler we’ll start with the unadjusted Langevin algorithm (ULA) with Euler dicretisation, as it is one of the simplest gradient-based samplers out there due to the lack of accept-reject step. Let \(\theta_n \in \mathcal{R}^d\) be the parameter at iteration \(n\), \(\nabla \log \pi(\theta)\) the gradient of the log-posterior, \(dt\) the step size, and \(\xi \sim \mathcal{N}(0, I_d)\). Given a current position of the chain, the next sample is given by the equation:

\[\theta_{n+1} = \theta_n + dt\nabla\log\pi(\theta_n) + \sqrt{2dt}\xi\]

The setup of the logistic regression model is the same as the one from this SG-MCMC review paper:

Matrix of covariates \(\textbf{X} \in \mathcal{R}^{N\times d}\), and vector responses: \(\textbf{y} = \{ y_i \}_1^N\)
Parameters: \(\theta \in \mathcal{R^d}\)

Model:

\(y_i = \text{Bernoulli}(p_i)\) with \(p_i = \frac{1}{ 1+\exp(-\theta^T x_i)}\)
Prior: \(\theta \sim \mathcal{N}(0, \Sigma_{\theta})\) with \(\Sigma_{\theta} = 10\textbf{I}_d\)
Likelihood: \(p(X,y \mid \theta) = \Pi^N p_i^{y_i}(1-p_i)^{1-y_i}\)

Version 1: Python loop with JAX for the log-posterior

In this version we only use JAX to write the log-posterior function (or the loss function in the case of optimisation). We use vmap to calculate the log-likelihood for each data point, jit to compile the function, and grad to get the gradient (see the code for the model on Github). The rest of the sampler is a simple Python loop with NumPy to store the samples, as is shown below:

def ula_sampler_python(grad_log_post, num_samples, dt, x_0, print_rate=500):
    dim, = x_0.shape
    samples = np.zeros((num_samples, dim))
    paramCurrent = x_0

    print(f"Python sampler:")
    for i in range(num_samples):
        paramGradCurrent = grad_log_post(paramCurrent)
        paramCurrent = paramCurrent + dt*paramGradCurrent +
                        np.sqrt(2*dt)*np.random.normal(size=(paramCurrent.shape))
        samples[i] = paramCurrent
        if i%print_rate==0:
            print(f"Iteration {i}/{num_samples}")
    return samples

In this sampler we write the udpate equation using NumPy and store the samples in the array samples.

Version 2: JAX for the transition kernel

With JAX we can compile functions using jit which makes them run faster (we did this for the log-posterior function). Could we not put the bit inside the loop in a function and compile that? The issue is that for jit to work, you can’t have NumPy arrays or use the NumPy random number generator (np.random.normal()).

JAX does random numbers a bit differently to NumPy. I won’t explain how this bit works; you can read about them in the documentation. The main idea is that jit-compiled JAX function don’t allow side effects, such as updating a global random state. As a result, you have to explicitly pass in a PRNG (called key) to every function that includes randomness, and split the key to get different pseudorandom numbers.

Below is a function for the transition kernel of the sampler rewritten to only include JAX functions and arrays (so it can be compiled). The point of the partial decorator and the static_argnums argument is to point to which arguments will not change once the function is compiled. Indeed, the function for the gradient of the log-posterior or the step size will not change throughout the sampler, but the PRNG key and the parameter definitely will! The means that the function will run faster as it can hardcode these static values/functions during compilation. Note that if the argument is a function (as is the case for grad_log_post) you don’t have a choice and must set it as static. See the documentation for info on this.

@partial(jit, static_argnums=(2,3))
def ula_kernel(key, param, grad_log_post, dt):
    key, subkey = random.split(key)
    paramGrad = grad_log_post(param)
    param = param + dt*paramGrad + jnp.sqrt(2*dt)*random.normal(key=subkey, shape=(param.shape))
    return key, param

The main loop in the previous function now becomes:

for i in range(num_samples):
    key, param = ula_kernel(key, param, grad_log_post, dt)
    samples[i] = param
    if i%print_rate==0:
        print(f"Iteration {i}/{num_samples}")

Notice how we split the random key inside ula_kernel() function which means it gets compiled (JAX’s random number generator can be slow in some cases). We still save the samples in the NumPy array samples as in the previous case. Running this function several times with the same starting PRNG key will now produce exactly the sample samples, which means that the sampler is completely reproducible.

Version 3: full JAX

We’ve written more of our function in JAX, but there is still some Python left. Could we rewrite the entire sampler in JAX? It turns out that we can! JAX does allow us write loops, but as it is designed to work on pure functions you need to use the scan function. This function which allows you to loop over an array (similar to doing for elem in mylist in Python).

The way to use scan is to pass in a function that is called at every iteration. This function takes in carry which contains all the information you use in each iteration (and which you update as you go along). It also takes in x which is the value of the array you’re iterating over. It should return an updated version of carry along with anything who’s progress you want to keep track of: in our case, we want to store all the samples as we iterate.

Note that JAX also has a similar fori_loop function which apparently you should only use if you can’t use scan (see the discussion on Github). In the case of our sampler scan is easier to use as you don’t need to explicitly keep track of the entire chain of samples; scan does it for you. In contrast, when using fori_loop you have to pass an array of samples in state which you update yourself as you go along. In terms of performance I did quick benchmark for both and didn’t see a speed difference in this case, though the discussion on Github says there can be speed benefits.

Here is the function that we’ll pass in scan. Note that the first line unpacks carry. The ula_kernel function then generates the new key and parameter. We then return the new version of carry (ie: (key, param)) which includes the updated key and parameter, and return the current parameter (param) which scan will save in an array.

def ula_step(carry, x):
  key, param = carry
  key, param = ula_kernel(key, param, grad_log_post, dt)
  return (key, param), param

You can then pass this function along with the initial state in scan, and recover the final carry along with all the samples. The last two arguments in the scan function below mean that we don’t care what we’re iterating over; we simply want to run the sampler for num_samples number of iterations (as always, see the docs for details).

carry = (key, x_0)
carry, samples = lax.scan(ula_step, carry, None, num_samples)

Putting it all together in a single function, we get the following. Notice that we compile the entire function with grad_log_post, num_samples, and dt kept as static. We allow the PRNG key and the starting point of the chain x_0 to vary so we can get different realisations of our chain.

@partial(jit, static_argnums=(1,2,3))
def ula_sampler_full_jax_jit(key, grad_log_post, num_samples, dt, x_0):

    def ula_step(carry, x):
        key, param = carry
        key, param = ula_kernel(key, param, grad_log_post, dt)
        return (key, param), param

    carry = (key, x_0)
    _, samples = lax.scan(ula_step, carry, None, num_samples)
    return samples

Having the entire function written in JAX means that once the function is compiled it will usually be faster (see benchmarks below), and we can rerun it for different PRNG keys or different initial conditions to get different realisations of the chain. We can also run this function in vmap (mapping over the keys or inital conditions) to get several chains running in parallel. Check out this blog post for a benchmark of a Metropolis sampler in parallel using JAX and Tensorflow.

Note that another way to do this would have been to split the initial key once at the beginning (keys = random.split(key, num_samples)) and scan over (ie: loop over) all these keys: lax.scan(ula_step, carry, keys). The ula_step and ula_kernel functions would then have to be modified slightly for this to work. This would simplify code even more as it means you don’t need to split the key at each iteration anymore.

The only thing left to do this the full JAX version is to print the progress of the chain, which is especially useful for long runs. This is not as straightforwards to do with jitted functions as with standard Python functions, but this discussion on Github goes over how to do this.

The final thing to point out is this JAX code ports directly to GPU without any modifications. See the appendix for benchmarks on a GPU.

Benchmarks

Now that we’ve gone over 3 ways to write an MCMC sampler we’ll show some speed benchmarks for ULA along with two other algorithms. We use the logistic regression model presented above and run 20 000 samples throughout.

These benchmarks ran on my laptop (standard macbook pro). You can find the benchmarks of the same samplers on a GPU in the appendix.

Unadjusted Langevin algorithm

Increase amount of data:

We run ULA for 20 000 samples for a 5 dimensional parameter. We vary the amount of data used and see how fast the algorithms are (time is in seconds).

dataset size	python	JAX kernel	full JAX (1st run)	full JAX (2nd run)
\(10^3\)	11	3.4	0.53	0.18
\(10^4\)	11	4.6	2.0	1.6
\(10^5\)	32	32	24	24
\(10^6\)	280	280	250	250

We can see that for small amounts of data the full JAX sampler is much faster than the Python loop. In particular, for 1000 data points the full JAX sampler (once compiled) is almost 60 times faster than the Python loop version.

Note that all the samplers use JAX to get the gradient of the log-posterior (including the Python loop version). So the speedup comes from everything else in the sampler being compiled. We also notice that for small amounts of data, there’s a big difference between the first full JAX run (where the function is being compiled) and the second one (where the function is already compiled). This speedup would be especially useful if you need to run the sampler many times for different starting points or realisation (ie: choosing a different PRNG key). We can also see that simply writing the transition kernel in JAX already causes a 3x speedup over the Python loop version.

However as we add more data, the differences between the algorithms gets smaller. The full JAX version is still the fastest, but not by much. This is probably because the log-posterior dominates the computational cost of the sampler as the dataset increases. As that function is the same for all samplers, they end up having similar timings.

Increase the dimension:

We now run the samplers with a fixed dataset size of 1000 data points, and run each sampler for 20K iterations while varying the dimension:

dimension	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(5\)	11	3.4	0.56	0.19
\(500\)	12	5.0	2.5	1.6
\(1000\)	13	7.7	4.3	3.4
\(2000\)	13	16	14	13

Here the story is similar to above: for small dimensionality the full JAX sampler is 60x faster than the Python loop version. But as you increase the dimension the gap gets smaller. As in the previous case, this is probably because the main effect of increasing the dimensionality is seen in the log-posterior function (which is in JAX for all the samplers).

The only difference to note is that the JAX kernel version is slower than the Python loop version for dimension 2000. Jake VanderPlas suggests that this has to with moving data around which has a low overhead for NumPy but can be expensive when JAX and NumPy interact. But in any case this reinforces the idea that you should always benchmark your code to make sure it’s fast.

Stochastic gradient Langevin dynamics (SGLD)

We now try the same experiment with stochastic gradient langevin dynamics sampler. This is the same as ULA but calculates gradients based on mini-batches rather than on the full dataset. This makes it suited for application with very large datasets, but the sampler produces samples that aren’t exactly from the target distribution (often the variance is too high).

The transition kernel below is therefore quite similar to ULA, but randomly chooses minibatches of data to calculate gradients with. Note also that the grad_log_post function includes the minibatch dataset as arguments. Also note that we sample minibatches with replacement (random.choice has replace=True as default). This is because sampling without replacement is very expensive is JAX, so doing this will dramatically slow down the sampler!

@partial(jit, static_argnums=(2,3,4,5,6))
def sgld_kernel(key, param, grad_log_post, dt, X, y_data, minibatch_size):
    N, _ = X.shape
    key, subkey1, subkey2 = random.split(key, 3)
    idx_batch = random.choice(subkey1, N, shape=(minibatch_size,))
    paramGrad = grad_log_post(param, X[idx_batch], y_data[idx_batch])
    param = param + dt*paramGrad + jnp.sqrt(2*dt)*random.normal(key=subkey2, shape=(param.shape))
    return key, param

Increase amount of data:

We run the same experiment as before: 20 000 samples for a 5 dimensional parameter. with increasing the amount of data. As before the timings are in seconds. The minibatch sizes we use are \(10\%\), \(10\%\), \(1\%\), and \(0.1\%\) respectively.

dataset size	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(10^3\)	59	4.2	1.1	0.056
\(10^4\)	60	4.4	3.8	0.9
\(10^5\)	73	3.9	1.6	0.40
\(10^6\)	65	4.0	2.0	0.69

Here we see that unlike in the case of ULA, we keep the large speedup from compiling everything in JAX. This is because the minibatches allow us to keep the cost of the log-posterior low.

We also notice that the Python and JAX kernel versions are slower than their ULA counterparts for low and medium datset sizes. This is probably due to the cost of sampling the minibatches and the fact that for these small dataset sizes the log-posterior function is efficient enough to not actually need minibatches. However for the last dataset (1 million data points) the benefit of using minibatches becomes clear.

Increase the dimension:

We now run the samplers with a fixed dataset size of 1000 data points, and run each sampler for 20K iterations while varying the dimension. We use as minibatch size 10% of the data for all 4 runs.

dimension	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(5\)	61	4.2	1.1	0.055
\(500\)	62	4.2	1.9	0.56
\(1000\)	62	5.0	2.3	0.98
\(2000\)	68	6.4	3.3	1.95

Here the two JAX samplers benefit from using minibatches, while the Python version is slower than its ULA counterpart in all cases.

Metropolis Adjusted Langevin algorithm (MALA)

We now re-run the same experiment but with MALA, which is like ULA but with a Metropolis-Hastings correction to ensure that the samples are unbiased. This correction means that the transition kernel is more computationally expensive:

@partial(jit, static_argnums=(3,5))
def mala_kernel(key, paramCurrent, paramGradCurrent, log_post, logpostCurrent, dt):
    key, subkey1, subkey2 = random.split(key, 3)
    paramProp = paramCurrent + dt*paramGradCurrent + jnp.sqrt(2*dt)*random.normal(key=subkey1, shape=paramCurrent.shape)
    new_log_post, new_grad = log_post(paramProp)

    term1 = paramProp - paramCurrent - dt*paramGradCurrent
    term2 = paramCurrent - paramProp - dt*new_grad
    q_new = -0.25*(1/dt)*jnp.dot(term1, term1)
    q_current = -0.25*(1/dt)*jnp.dot(term2, term2)

    log_ratio = new_log_post - logpostCurrent + q_current - q_new
    acceptBool = jnp.log(random.uniform(key=subkey2)) < log_ratio
    paramCurrent = jnp.where(acceptBool, paramProp, paramCurrent)
    current_grad = jnp.where(acceptBool, new_grad, paramGradCurrent)
    current_log_post = jnp.where(acceptBool, new_log_post, logpostCurrent)
    accepts_add = jnp.where(acceptBool, 1,0)
    return key, paramCurrent, current_grad, current_log_post, accepts_add

We run the usual 3 versions: a Python sampler with JAX for the log-posterior, a Python loop with the JAX transition kernel, and a “full JAX” sampler.

Increase amount of data:

We run each sampler for 20 000 samples for a 5 dimensional parameter while varying the size of the dataset.

dataset size	python	JAX kernel	full JAX (1st run)	full JAX (2nd run)
\(10^3\)	38	7	0.93	0.19
\(10^4\)	38	7	2.6	1.9
\(10^5\)	56	35	27	26
\(10^6\)	330	310	270	272

The story here is similar to the story in the case of ULA. The main difference is that the speedup for the full JAX sampler is more pronounced in this case (especially for the smaller datasets). Indeed, for 1000 data points the full JAX (once it’s compiled) is 200 times faster than the Python loop version. This is probably because the transition kernel is more complicated and so contributes more to the overall computational cost of the sampler. As a result compiling it brings a larger speed increase than for ULA.

Furthermore (in the case of ULA) when the dataset size is increased the speed of the samplers start to converge to the same value.

Increase the dimension:

We now run the samplers with a fixed dataset size of 1000 data points, and run each sampler for 20K iterations while varying the dimension.

dimension	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(5\)	39	7.2	0.94	0.20
\(500\)	40	7.2	2.6	1.5
\(1000\)	41	8.0	4.8	3.4
\(2000\)	43	15	14	13

We we have a similar story to the case of increasing data: using full JAX speeds up the sampler a lot, but that gap gets smaller as you increase the dimensionality.

Conclusion

We’ve seen that there are different ways to write MCMC samplers by having more or less of the code written in JAX. On one hand, you can use JAX to write the log-posterior function and use Python/NumPy for the rest. On the other hand you can use JAX to write the entire sampler. We’ve also seen that in general the full JAX sampler is faster than the Python loop version, but that this difference gets smaller as the amount of data and dimensionality increases.

The main conclusion we take from this is that in general writing more things in JAX speeds up the code. However you have to make sure it’s well written so you don’t accidentally slow things down (for example by re-compiling a function at every iteration by mis-using static_argnums when jitting). You should therefore always benchmark code and compare different ways of writing it.

All the code for this post is on Github

Thanks to Jake VanderPlas and Remi Louf for useful feedback on this post as well as the High End Computing facility at Lancaster University for the GPU cluster (results in the appendix below)

Appendix: GPU benchmarks

edit: added this section on 9th December 2020

We show here the benchmarks on a single GPU compute node. For the runs where the dataset size increases we run the samplers for 20 000 iterations for a 5 dimensional parameter. For the ones where we increase the dimension we generate 1000 data points and run 20 000 iterations.

Note that here the ranges of dataset sizes and dimensions are much larger as the timings essentially didn’t vary for the ranges used in the previous benchmarks. Also notice how for small dataset sizes and dimensions the samplers are faster on CPU. This is because the GPU has a fixed overhead cost. However as the datasets gets larger the GPU does much better.

Timings are all in seconds.

ULA

dataset size	python	JAX kernel	full JAX (1st run)	full JAX (2nd run)
\(10^3\)	18	8.4	2.2	1.5
\(10^6\)	18	12	5.8	5.2
\(10^7\)	49	50	43	42
\(2*10^7\)	90	92	84	82

dimension	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(100\)	18	8.2	2.2	1.5
\(10^4\)	33	10	4.0	3.0
\(2*10^4\)	47	14	6.5	5.0
\(3*10^4\)	61	18	9.1	7.1

SGLD

The minibatch sizes for the increasing dataset sizes are \(10\%\), \(10\%\), \(1\%\), and \(0.1\%\) respectively.

dataset size	python	JAX kernel	full JAX (1st run)	full JAX (2nd run)
\(10^3\)	80	11	3.6	2.8
\(10^6\)	95	10	3.3	2.9
\(10^7\)	120	10	3.4	3.0
\(2*10^7\)	90	10	3.3	2.9

dimension	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(100\)	80	11	3.8	2.9
\(10^4\)	96	12	3.6	3.0
\(2*10^4\)	109	13	3.6	2.9
\(3*10^4\)	122	14	3.6	3.0

MALA

dataset size	python	JAX kernel	full JAX (1st run)	full JAX (2nd run)
\(10^3\)	57	14	3.2	2.4
\(10^6\)	56	14	6.9	5.8
\(10^7\)	83	54	46	44
\(2*10^7\)	126	98	89	86

dimension	python	JAX kernel	full JAX (1st run)	full JAX (1nd run)
\(100\)	57	14	3.6	2.7
\(10^4\)	72	16	5.4	3.6
\(2*10^4\)	88	17	9.4	5.7
\(3*10^4\)	101	19	12	7.8

Implementing natural numbers in OCaml

2020-04-06T08:00:00+00:00

In this post we’re going to implement natural numbers (positive integers) in OCaml to see how we can define numbers from first principle, namely without using OCaml’s built in Integer type. We’ll then write a simple UI so that we have a basic (but inefficient) calculator. You can find all the code for this post on Github.

Definition

We’ll start with a recursive definition of natural numbers:

\[n \in \mathcal{N} \iff n = \begin{cases}0 \\ S(m) \hspace{5mm} \text{for }m \in \mathcal{N} \end{cases}\]

We used the function \(S(m)\) which is called the successor function. This simply returns the next natural number (for example \(S(0)=1\), and \(S(4)=5\)).

This definition means that a natural number is either \(0\) or the successor of another natural number. For example \(0\) is a natural number (the first case in the definition), but \(1\) is also a natural number, as it’s the successor of \(0\) (you would write \(1=S(0)\)). 2 can then be written as \(2 = S(S(0))\) , and so on. By using recursion (the definition of a natural number includes another natural number) we can “bootstrap” building numbers without using many other definitions.

We now write this definition as a type in OCaml, which looks a lot like the mathematical definition above:

type nat =
  | Zero
  | Succ of nat

The vertical lines denote the two cases. Here you would write 1 as Succ Zero, 2 as Succ Succ Zero, and so on.

However we haven’t said what these numbers are (what is zero? What are numbers? ). To do that we need to define how they act.

Some operators

We’ll start off by defining how we can increment and decrement them.

let incr n =
  Succ n

let decr n =
  match n with
  | Zero -> Zero
  | Succ nn -> nn

The increment function simply adds a Succ before the number, so this corresonds to adding 1. So incr (Succ Zero) returns Succ Succ Zero. The decrement function checks whether the number n is Zero or the successor of a number. In the first case it simply returns Zero (So this means that decr Zero returns Zero. However this could be extended to include negative numbers). In the second case the function returns the number that precedes it. So decr (Succ Succ Succ Zero) returns Succ Succ Zero.

Addition

We can now define addition as a recursive function which we denote by ++ (in OCaml we define infix operators using parentheses). So the addition function takes two elements n and m of type nat and returns an element of type nat. Note the rec added before the function name which means that it’s recursive.

let rec (++) n m =
  match m with
  | Zero -> n
  | Succ mm -> (Succ n) ++ mm

Because we defined the function to be an infix operator we put it in between the arguments (ex: Zero ++ (Succ Zero)). This function checks whether m is Zero or the successor of a number. If it’s a successor of mm it returns the sum of mm and Succ n.

Let check that this definition behaves correctly by calculating 1+1 which we write as (Succ Zero) ++ (Succ Zero). The first call to the function finds that the second argument is the successor of Zero, so returns the sum (Succ Succ Zero) ++ Zero. This calls the functions a second time which finds that the second argument is Zero. As a result the function return Succ Succ Zero which is 2 !

So in summary 1+1 is written as (Succ Zero) ++ (Succ Zero) = (Succ Succ Zero) ++ Zero = Succ Succ Zero. Math still works!

Subtraction

We now define subtraction:

let rec (--) n m =
  match m with
  | Zero -> n
  | Succ mm -> (decr n) -- mm

This decrements both arguments until the second one is Zero. Note that if m is bigger than n then n -- m will still equal Zero.

Multiplication

Moving on, we define multiplication:

let (+*) n m =
  let rec aux n m acc =
    match m with
    | Zero -> acc
    | Succ mm -> aux n mm (n ++ acc)
  in
  aux n m Zero

Here we use an auxiliary function (aux) which builds up the result in the accumulator acc by adding n to it m times. So applying this function to \(3\) and \(2\) gives: \(3*2 = 3*1 + 3 = 3*0 + 6 = 6\). And in code this is:

(Succ (Succ (Succ Zero))) +* (Succ (Succ Zero))
Which returns ((Succ (Succ (Succ Zero))) +* (Succ Zero)) ++ (Succ (Succ (Succ Zero)))
Which returns ((Succ (Succ (Succ Zero))) +* Zero) ++ (Succ (Succ (Succ (Succ (Succ (Succ Zero))))))
which returns (Succ (Succ (Succ (Succ (Succ (Succ Zero)))))) (namely \(6\))

Division

We also define the ‘strictly less than’ operator which we then use to define integer division.

let rec (<<) n m =
  match (n, m) with
  | (p, Zero) -> false
  | (Zero, q) -> true
  | (p, q) -> (decr n) << (decr m)

let (//) n m =
  let rec aux p acc =
    let lt = p << m in
    match lt with
    | true -> acc
    | false -> aux (p -- m) (Succ acc)
  in
  aux n Zero

Like in the case of multiplication, the division function defines an auxiliary function that builds up the result in the accumulator acc. This function checks whether the first argument p is less than m. If it isn’t, then increment the accumulator by 1 and call aux again but with p-m as the first argument. Once p is less than m then return the accumulator. So this auxiliary function counts the number of times that m fits into p, which is exactly what integer division is. We run this function with n as first argument and with the accumulator as Zero.

Finally we can define the modulo operator. As we use previous definitions of division, multiplication, and subtraction, this definition is abstracted away from our implementation of natural numbers. This function gives the remainder when dividing n by m.

let (%) n m =
  let p = n // m in
  n -- (p +* m)

A basic UI

We’ve defined the natural numbers and the basic operators, but it’s a bit unwieldy to use them in their current form. So we’ll write some code to convert them to the usual number system (represented as strings) and back.

From type `nat` to string representation

We’ll write some code to convert numbers to base 10 and then represent them in the usual Arabic numerals.

let ten = Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ Zero)))))))))

let base10 n =
  let rec aux q acc =
    let r = q % ten in
    let p = q // ten in
    match p with
    | Zero -> r::acc
    | pp -> aux p (r::acc)
  in
  aux n []

This function returns a list where each element corresponds to the number of 1s, 10s, 100s etc in the number. So if n is Succ Succ Succ Succ Succ Succ Succ Succ Succ Succ Succ Succ Zero (ie: 12), then base10 n returns [Succ Zero; Succ Succ Zero].

We then define the 10 digits (with a hack for the cases bigger than 9) and put it all together in the function string_of_nat.

let print_nat_digits = function
  | Zero -> "0"
  | Succ Zero -> "1"
  | Succ Succ Zero -> "2"
  | Succ Succ Succ Zero -> "3"
  | Succ Succ Succ Succ Zero -> "4"
  | Succ Succ Succ Succ Succ Zero -> "5"
  | Succ Succ Succ Succ Succ Succ Zero -> "6"
  | Succ Succ Succ Succ Succ Succ Succ Zero -> "7"
  | Succ Succ Succ Succ Succ Succ Succ Succ Zero -> "8"
  | Succ Succ Succ Succ Succ Succ Succ Succ Succ Zero -> "9"
  | _ -> "bigger than 9"

let string_of_nat n =
  let base_10_rep = base10 n in
  let list_strings = List.map print_nat_digits base_10_rep in
  String.concat "" list_strings

string_of_nat converts the number of type nat to base 10, then maps each of the list element to a string and concatenates those strings.

So string_of_nat (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ Zero)))))))))))) returns "12" which is easier to read!

From string representation to type `nat`

We then define some code to go the other way around: from string representation to natural numbers.

let string_to_list s =
  let rec loop acc i =
    if i = -1 then acc
    else
      loop ((String.make 1 s.[i]) :: acc) (pred i)
  in loop [] (String.length s - 1)


let nat_of_listnat l =
  let lr = List.rev l in
  let rec aux n b lr =
    match lr with
    | [] -> n
    | h::t -> aux (n ++ (b+*h)) (b+*ten) t
  in
  aux Zero (Succ Zero) lr

let nat_of_string_digits = function
  | "0" -> Zero
  | "1" -> Succ Zero
  | "2" -> Succ (Succ Zero)
  | "3" -> Succ (Succ (Succ Zero))
  | "4" -> Succ (Succ (Succ (Succ Zero)))
  | "5" -> Succ (Succ (Succ (Succ (Succ Zero))))
  | "6" -> Succ (Succ (Succ (Succ (Succ (Succ Zero)))))
  | "7" -> Succ (Succ (Succ (Succ (Succ (Succ (Succ Zero))))))
  | "8" -> Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ Zero)))))))
  | "9" -> Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ (Succ Zero))))))))
  | _ -> raise (Failure "string must be less than 10")

(* Converts string to nat *)
let nat_of_string s =
  let liststring = string_to_list s in
  let listNatbase = List.map nat_of_string_digits liststring in
  nat_of_listnat listNatbase


(*
  final (infix) functions for adding, subtracting, multiplying, and dividing
  which take strings as arguments and return a string
*)
let (+++) n m =
 string_of_nat ((nat_of_string n) ++ (nat_of_string m))

let (---) n m =
 string_of_nat ((nat_of_string n) -- (nat_of_string m))

let (+**) n m =
 string_of_nat ((nat_of_string n) +* (nat_of_string m))

let (///) n m =
 string_of_nat ((nat_of_string n) // (nat_of_string m))

let (%%) n m =
string_of_nat ((nat_of_string n) % (nat_of_string m))

So putting it all together, we have a working calculator for natural numbers!

Let’s try it out:

"3" +++ "17" returns "20"
"182" --- "93" returns "89"
"12" +** "3" returns "36"
"41" /// "3" returns "13"
"41" %% "3" returns "2"

Conclusion

We have built up natural numbers from first principles and now have a working calculator. However these operators start getting very slow for numbers of around 7 digits or more, so sticking with built-in integers sounds preferable..

All the code for this post is on Github

Thanks to James Jobanputra for useful feedback on this post

Testing MCMC code: the prior reproduction test

2020-02-04T08:00:00+00:00

Markov Chain Monte Carlo (MCMC) is a class of algorithms for sampling from probability distributions. These are very useful algorithms, but it’s easy to go wrong and obtain samples from the wrong probability distribution. What’s more, it won’t be obvious if the sampler fails, so we need ways to check whether it’s working correctly.

This post is mainly aimed at MCMC practitioners and describes a powerful MCMC test called the Prior Reproduction Test (PRT). I’ll go over the context of the test, then explain how it works (and give some code). I’ll then explain how to tune it and discuss some limitations.

Why should we test MCMC code ?

There are two main ways MCMC can fail: either the chain doesn’t mix or the sampler targets the wrong distribution. We say that a chain mixes if it explores the target distribution in its entirety without getting stuck or avoiding a certain subset of the space. To check that a chain mixes, we use diagnostics such as running the chain for a long time and examining the trace plots, calculating the \(\hat{R}\) (or potential scale reduction factor), and using the multistart heuristic. See the Handbook of MCMC for a good overview of these diagnostics. These help check that the chain converges to a distribution.

However the target distribution of the sampler may not be the correct one. This could be due to a bug in the code or an error in the maths (for example the Hastings correction in the Metropolis-Hastings algorithm could be wrong). To test the software, we can do tests such as unit tests which check that individual functions act like they should. We can also do integration tests (testing the entire software rather than just a component). One such test is to try to recover simulated values (as recommended by the Stan documentation): generate data given some “true” parameters (using your data model) and then fit the model using the sampler. The true parameter that should be within the credible interval (loosely within 2 standard deviations of it). This checks that the sampler can indeed recover the true parameter.

However this test is only a “sanity check” and doesn’t check whether samples are truly from the target distribution. What’s needed here is a goodness of fit (GoF) test. As doing a GoF test for arbitrarily complex posterior distributions is hard, the PRT reduces the problem to testing that some samples are from the prior rather than the posterior. I had trouble finding books or articles written about this (a similar version of this test is described by Cook, Gelman, and Rubin here, but they don’t call it PRT); if you know of any references let me know! [Update April 2021: Nianqiao (Phyllis) Ju has pointed out some references for this in the literature: Geweke (2004) describes the same method. Other similar approaches: Simulation Based Calibration (2020), and Testing MCMC Code (2014)].

I know of this test from my PhD supervisor Yvo Pokern who learnt it from another researcher during his postdoc. From talking to other researchers, it seems that this method has often been transmitted by word of mouth rather than from textbooks.

The Prior Reproduction Test

The prior reproduction test runs as follows: sample from the prior \(\theta_0 \sim \pi_0\), generate data using this prior sample \(X \sim p(X|\theta_0)\), and run the to-be-tested sampler long enough to get an independent sample from the posterior \(\theta_p \sim \pi(\theta|X)\). If the code is correct, the samples from the posterior should be distributed according to the prior. One can repeat this procedure to obtain many samples \(\theta_p\) and test whether they are distributed according to the prior.

Here is the test in Python (code available on Github). First we define the observation operator \(\mathcal{G}\)) (the mapping from parameter to data, in this case simply the identity) along with the log-likelihood, log-prior, and log-posterior. So here our data is simply sampled from a Gaussian with mean 5 and standard deviation 3.

def G(theta):
	"""
	G(theta): observation operator. Here it's just the identity function, but it could
	be a more complicated model.
	"""
	return theta

# data noise:
sigma_data = 3

def build_log_likelihood(data_array):
	"Builds the log_likelihood function given some data"
	def log_likelihood(theta):
		"Data model: y = G(theta) + eps"
		return - (0.5)/(sigma_data**2)
			* np.sum([(elem - G(theta))**2 for elem in data_array])
	return log_likelihood

def log_prior(theta):
	"uniform prior on [0, 10]"
	if not (0 < theta < 10):
		return -9999999
	else:
		return np.log(0.1)

def build_log_posterior(log_likelihood):
	"Builds the log_posterior function given a log_likelihood"
	def log_posterior(theta):
		return log_prior(theta) + log_likelihood(theta)
	return log_posterior

We want to the test the code for a Metropolis sampler with Gaussian proposal (given in the MCMC module), so we run the PRT for it (the following code is in the run_PRT() function in PRT.py):

results = []
B = 200

for elem in range(B):
	# sample from prior
	sam_prior = np.random.uniform(0,10)

	# generate data points using the sampled prior
	data_array = G(sam_prior) + np.random.normal(loc=0, scale=sigma_data, size=10)

	# build the posterior function
	log_likelihood = build_log_likelihood(data_array=data_array)
	log_posterior = build_log_posterior(log_likelihood)

	# define the sampler
	ICs = {'theta': 1}
	sd_proposal = 20
	mcmc_sampler = MHSampler(log_post=log_posterior, ICs=ICs, verbose=0)
	# add a Gaussian proposal
	mcmc_sampler.move = GaussianMove(ICs, cov=sd_proposal)

	# Get a posterior sample.
	# Let the sampler run for 200 iterations to make sure it's independent from the initial condition
	mcmc_sampler.run(n_iter=200, print_rate=300)
	last_sample = mcmc_sampler.all_samples.iloc[-1].theta

	# store the results. Keep the posterior sample as well as the prior that generated the data
	results.append({'posterior': last_sample, 'prior': sam_prior})

We then check that the posterior samples are uniformly distributed (i.e. the same as the prior) (see figure 1). Here we do this by eye, but we could have done this more formally (for example using the Kolmogorov-Smirnov test).

Figure 1: Empirical CDF of the output of PRT: these seem to be uniformly distributed

Tuning the PRT

Notice how we let the sampler run for 200 iterations to make sure that the posterior sample we get is independent of the initial condition (mcmc_sampler.run(n_iter=200, print_rate=300)). The number of iterations used needs to be tuned to the sampler; if it’s slow then you’ll need more samples. This means that a slowly mixing sampler will cause the PRT to become more computationally expensive. We also needed to tune the proposal variance in the Gaussian proposal (called sd_proposal); ideally this will be a good tuning for any dataset generated in the PRT, but this may not always be the case. Sometimes the sampler needs hand tuning for each generated dataset; in this case it may also be too expensive to run the entire test. We’ll see later what other tests we can do in this case.

Finally, how do we choose the amount of data to generate (here we chose 10 data points)? Consider 2 extremes: if we choose too much data then the posterior will have a very low variance and will be centred around the true parameter. So almost any posterior sample we obtain will be close to the true parameter (which we sampled from the prior), and so the PRT will (trivially) produce samples from the prior. This doesn’t test the statistical properties of the sampler, but rather tests that the posterior is centred around the true parameter. In the other extreme case, if we have too little data the likelihood will have a weak effect on the posterior, which will then essentially be the prior. The MCMC sampler will then sample from a distribution that is very close to prior, and again the PRT becomes weaker. We therefore need to choose somewhere in the middle.

To tune the amount of data to generate we can plot the posterior vs the prior samples from the PRT as we can see in figure 2 below. Ideally there is a nice amount of variation around the line y=x as in the middle plot (for N=10 data points). In the other two case the PRT will trivially recover prior samples and not test the software properly.

Figure 2: We need to tune the amount of data to generate in PRT

Limitations and alternatives

In some cases however it’s not possible to run the PRT. The likelihood may be too computationally expensive; it might require solving numerically a differential equation for example. It’s also possible that the proposal distribution needs to be tuned for each dataset. In this case you have to tune the proposal manually at each iteration of the PRT.

A way to deal with these problems is to only test conditionals of the posterior (in the case of higher dimensional posteriors). For example if the posterior is \(\pi(\theta_1, \theta_2)\), then run the test on \(\pi(\theta_1 | \theta_2)\). In some cases this can solve the problem of needing to retune the proposal distribution for every dataset. This also helps with the problem of expensive likelihoods, as the dimension of the conditional posterior is lower than the original one. Less samples are then needed to run the test.

Another very simple alternative is to use the sampler to sample from the prior (so simply commenting out the likelihood function in the posterior). This completely bypasses the problem of expensive likelihoods and the need to retune the proposal at every step. This test checks that the MCMC proposal is correct (the Hastings correction for example), so is good for testing complicated proposals. However if the proposal needed to sample from the prior is qualitatively different from the proposal needed to sample from the posterior, then it’s not a useful test.

As mentioned in the introduction, the PRT reduces to testing goodness of fit of prior samples, the idea being that this is easier to test as prior distributions are often chosen for their simplicity. One can of course test goodness of fit on the MCMC samples directly (without the PRT) using a method such as the Kernel Goodness-of-fit test. This avoids the problems discussed above, but it requires gradients of the log target density, whereas the PRT makes no assumptions about the target distribution.

Conclusions

The Prior Reproduction Test is a powerful way to test MCMC code but can be expensive computationally. This test - along with its simplified versions described above - can be included in an arsenal of diagnostics to check that MCMC samples are from the correct distribution.

Code to reproduce the figures is on Github

Thanks to Heiko Strathmann and Lea Goetz for useful feedback on this post

The DjangoVerse

2019-11-27T11:01:52+00:00

The DjangoVerse is a 3D graph of gypsy jazz players around the world. I designed this with Matt Holborn (he got the idea from the Rhizome) and built it using React and Django.

How does it work ?

As anyone can modify it, people can add themselves or players they know to it. If you click on a player you get information about them: what instrument they play, a picture of them, a short bio, and a link to a youtube video of them. As the names are coloured by country, you can immediately see how many players there are in the different countries around the world. You can try out the DjangoVerse in the figure below:

The DjangoVerse

The players have a link between them if they have gigged together, and if you click on a player you get those links highlighted in red. This allows you to see at a glance who they’ve played with and whether they’ve played with people from different countries. You can also filter the graph to only display players from chosen countries, based on the instruments they play, or whether or not they’re active. We started out by added around 60 players ourselves, and then shared it on Facebook and Instagram; the gypsy jazz community added the rest (there are 220 players across 21 countries at the time of writing).

Tech stack

I built the graph with React and D3 force directed graph and hosted it on S3 (see code). The API is built using Django and Postgres and is hosted on Heroku (with S3 for static files). As the DjangoVerse is part of the London Django Collective, I used the same Django application to serve the pages for the Collective as well as the API. As the React app with the graph is hosted on S3, the page in the Collective website simply has an iframe that points to it.

The design process

A first attempt

The main motivation was that I’ve wanted for a long time to create a 3D graph mapping links between related things (and had ideas about doing this for academic disciplines, jazz standards, and more). So this project was a way to scratch that itch. The objective more specifically was to be able to visualise the gypsy jazz scene in one place, discover new players and bands, and let people be able to promote their music/bands.

As a result we started off with many different types of nodes: players, bands, festivals, albums, and venues. So each of these would be added to the graph along with links between them. A link between a player and band would mean that a players is in a band, a link between a band and a festival would mean that it’s played at the festival, and so on. Each node would be a sphere of different size (the size would depend on the type) and the name would appear on hover; this was inspired by Steemverse (a visualisation of a social network).

Furthermore, the links between two nodes would also have information about it, such as the year a band has played in a festival, or the years a player was active in a band. You would then be able to filter the graph to only show what happened in a given year, which would give a “snapshot” of the gypsy jazz scene at that moment in time.

Too much stuff

However, it quickly became clear that it was too much information: having all these types of nodes and information about the links would be too overwhelming to have in the graph. So we removed the venue and album types, along with the information about each link. We kept only the active/inactive tags which would allow to differentiate between the gypsy jazz scene in past and in the present.

We then tested a prototype (with players, bands, and venues all represented as spheres of different sizes) with some friends (see the classic Don’t Make Me Think for an overview of user testing), and it turned out that it wasn’t very clear what the DjangoVerse was. For example one reaction was “I’m guessing it’s a simulation of a molecule or something”, which makes sense given that it essentially looked like this. This could maybe be fixed by adding names next to the nodes, but if you do this then D3 starts lagging quite quickly as you add many players.

Another problem was that festivals naturally ended up being at the centre of the graph, as they were the nodes with the most connections. The players and bands themselves then ended up seeming less important, even though we think a style of music is mainly about the players themselves rather than the festivals. As a visualisation is supposed to bring out the aspects of the data that the designer thinks is most important, we needed to have the players be more prominent.

Simplifying the design

A fix to both of these problems was to simplify the graph again: we remove festivals and albums and kept just the players. We also just showed the names of the players rather than the spheres. As the names are immediately visible, a user can then recognise some of the players and guess immediately what this is about (this was confirmed with testing). However a downside of this is that having all the names rather than just spheres causes the graph to lag when there are more than 100 or so players. Steemverse gets around this problem by only having names for the “category” types of nodes (which are rare); all other spheres only have names on hover.

For the aspect of users adding players, there is no authentication so anyone can add or modify a player without needing to log in. The benefit is that there is less of a barrier for people to add to the graph, but with the risk of people posting spam (or deleting all the players!). To mitigate this, I set up daily backups (easy to do with Heroku) which would allow to restore the graph to before there was a problem. If the problem persisted, I would have simply added authentication (for example OAuth with Google/Facebook/etc..).

Outcomes and comparison to other graphs

Players on the gypsy jazz scene around the world added lots of players to the graph: there are 220 players spanning 21 countries and with 9 instruments represented. A feature that was used a lot was the possibility of adding a youtube video: this allows each player to showcase their music. The short bio for each player was also interesting; when we added the bio we didn’t think much of it nor consider too much how it would be used. However some of the users added information such as which players were related to each other (father, cousin etc..) which was really interesting!

Lessons

In terms of design, an important take-away to be learnt from graph visualisations such as this is about how much information to include in it. Although a main aspect of these visualisations is just “eye-candy” (ie: it looks fun), it would be good if it was also informative or insightful. At one end of the spectrum, if there is too little information then there is not much to learn from the visualisation. At the other extreme, if there is too much information (and the design isn’t done carefully) then it’s easy to get overwhelmed. For me, some examples of this are Wikiverse (it has a huge amount of information (it’s a subset of wikipedia!) and I find the interface very confusing), Steemverse (it looks great, but there’s not much information in it) or the Rhizome (as it’s in only 2 dimensions, it’s hard to see what’s going on in the graph).

In contrast, an example of a simple graph that I think works well is this map of “theories of everything”. I don’t understand what these theories are (these are disciplines in theoretical physics), but the design is done very well and classifies them in a clear way.

Other examples of very well designed graphs are the ones built by concept.space, such as this map of philosophy. It has a huge amount of information, but most of it is hidden if you are zoomed out. As you zoom into a specific area of philosophy you get more and more detail about that area of philosophy until you have individual papers. When you click on a paper you then get the abstract and a link to it.

Notice also the minimap in the lower right hand corner that reminds you of where you currently are in the map. Finally, it seems that they have automated the process of adding and clustering the papers (from looking at the software credited on their website). They seemed to have scraped PhilPapers, used Word2Vec to get word embeddings for each paper, reduced the dimension of the space, and finally clustered the result to find the location of each paper in the 2 dimensional map. As a result they could then use this workflow to create a similar map for climate science and biomedicine.

In conclusion, the idea of a visual map showing the links between different things in a discipline (players in gypsy jazz, papers in philosophy, etc..) is a very appealing one. However, getting it right is surprisingly difficult; for me the best example is the map of philosophy described above.

Thanks to Lukas DeRungs for reading a draft of this post

Jeremie Coullon

Early Monte Carlo methods - Part 2: the Metropolis sampler

The classic Metropolis sampler paper

Overview of paper

Discussion

MANIAC

Historical context

Arithmetic

Negative numbers

Subtraction

first case: \(a>b\)

second case: \(a<b\)

Modular arithmetic

Writing programs

Conclusion

Early Monte Carlo methods - Part 1: the 1949 conference

Historical context: Monte Carlo in the 1940s

Computing during World War 2

Monte Carlo in the 40s

The 1949 Conference on Monte Carlo

the Middle Square Method

Rejection sampling

Round table discussion

Thoughts on this conference

Conclusion

Ensemble samplers can sometimes work in high dimensions

Functional ensemble sampler

A motivational problem

Sampling from function spaces

pCN

FES

Performance

Discussion

High-dimensional distributions

Conclusion

How to add a progress bar to JAX scans and loops

Setup: sampling a Gaussian

Basic progress bar

Put it in a decorator

tqdm progress bar

Conclusion

MCMC in JAX with benchmarks: 3 ways to write a sampler

Sampler and model

Version 1: Python loop with JAX for the log-posterior

Version 2: JAX for the transition kernel

Version 3: full JAX

Benchmarks

Unadjusted Langevin algorithm

Increase amount of data:

Increase the dimension:

Stochastic gradient Langevin dynamics (SGLD)

Increase amount of data:

Increase the dimension:

Metropolis Adjusted Langevin algorithm (MALA)

Increase amount of data:

Increase the dimension:

Conclusion

Appendix: GPU benchmarks

ULA

SGLD

MALA

Implementing natural numbers in OCaml

Definition

Some operators

Addition

Subtraction

Multiplication

Division

A basic UI

From type nat to string representation

From string representation to type nat

Conclusion

Testing MCMC code: the prior reproduction test

Why should we test MCMC code ?

The Prior Reproduction Test

Tuning the PRT

Limitations and alternatives

Conclusions

The DjangoVerse

How does it work ?

From type `nat` to string representation

From string representation to type `nat`