用 OCaml 实现贝叶斯程序合成(英文)
This post aims at reproducing some results from Saad et al.’s paper on Bayesian synthesis1 in OCaml, an industrial strength functional programming language. A complete notebook for the code in this post can be found here. The data used for a Bayesian-style time-series analysis can be found here.
Environment Setup
The code in this post is tested under OCaml 4.14.0 on macOS 12.6.1 with an M2 chip. The environment is supposed to contain a Python distribution with jupyter and matplotlib installed. The code should also work on macOS (or Linux) with Intel/AMD chips.
I am maintaining a ppl
repo that contains some extensions of existing OCaml packages.
We will use libtorch
in this post and we need to pin the ocaml-torch
package to version 0.14
, because currently the only M1/M2-compatible libtorch
library provided by my ppl
repo is only compatible with this version of ocaml-torch
.
The torch_ext.0.14
and matplotlib.20221112
packages are provided by my ppl
repo:
the former adds some functionality about probability distributions on tensors
and the latter supports more plotting controls and functions.
We will use owl
for parsing CSV files.
For M1/M2 users, you may need to refer to this comment to install owl
.
opam remote add ppl https://github.com/stonebuddha/ppl-opam.git
opam pin https://github.com/LaurentMazare/ocaml-torch.git#0.14 --with-version=0.14
opam install core jupyter
opam install torch_ext.0.14 matplotlib.20221112
opam install owl
The following command registers OCaml as a jupyter kernel (you might take a look at the installation log of the OCaml jupyter
package to find out the exact command):
jupyter kernelspec install --name ocaml-jupyter /PATH/TO/YOUR/OCAML/SWITCH/share/jupyter
For Anaconda-based Python distributions, or for Python distributions that are not linked to standard paths for looking for runtime libraries, you might need to update the kernels/ocaml-jupyter/kernel.json
file located in your jupyter installation.
For example, I need to add the following environment variable before invoking ocaml-jupyter-kernel
:
DYLD_LIBRARY_PATH=/opt/homebrew/anaconda3/lib:$DYLD_LIBRARY_PATH
Finally, we need to add the following code to the .ocamlinit
file (which should be located in your home folder):
#use "topfind";;
Topfind.log := ignore;;
Now you should be able to create and run an OCaml notebook in jupyter.
OCaml Torch Basics
Here is the link to the official introduction of the ocaml-torch
package.
Most APIs are directly generated from libtorch
, and thus similar to the APIs of PyTorch, but in a more functional way.
Tensors are of type Tensor.t
and scalars are of type Scalar.t
.
Because OCaml is statically and strongly typed, we cannot directly multiply a tensor with a floating-point number; instead, we need to first embed a floating-point number as a scalar via Scalar.f : float -> Scalar.t
and then perform the tensor-scalar multiplication via Tensor.mul_scalar : Tensor.t -> Scalar.t -> Tensor.t
.
One benefit of this programming style is that it removes many runtime checks because we know the type information at compile time.
However, this style is not very succinct compared to Python.
Even worse, lack of ad-hoc polymorphism (e.g., operator overloading) makes programming in OCaml more inconvenient than Haskell in this setting.
One workaround is that OCaml supports scoped function calls (operators are also functions).
For example, if t1 : Tensor.t
and t2 : Tensor.t
are two tensors of the same shape,
we can write Tensor.(t1 + t2)
for pointwise addition of the two tensors.
Gaussian Process Regression
Readers can refer to the textbook Gaussian Processes for Machine Learning for more details on Gaussian Processes. In this post, we focus on a Bayesian-style time-series analysis based on Gaussian Process Regression. Let $\mathbb{T}$ be an index set and $X \triangleq \{ X(t) \mid t \in \mathbb{T} \}$ be a collection of real-valued random variables. We say $X$ is a Gaussian Process if for any $\mathbf{t} = [t_1, t_2, \cdots, t_n]$ of distinct indexes, the random vector $X(\mathbf{t}) \triangleq [X(t_1), X(t_2), \cdots, X(t_n)]$ has a joint Gaussian distribution.
We can represent a Gaussian Process by its mean function $m : \mathbb{T} \to \mathbb{R}$ and covariance function $k : \mathbb{T} \times \mathbb{T} \to \mathbb{R}$ that satisfy for all $t,t’ \in \mathbb{T}$, it holds that $m(t) = \mathbb{E}[X(t)]$ and $k(t,t’) = \mathbb{E}[ (X(t) - m(t)) (X(t’) - m(t’))]$. Let $m(\mathbf{t})$ denote the mean vector $[m(t_1), m(t_2), \cdots, m(t_n)]$. Let $k(\mathbf{t}, \mathbf{t})$ denote the covariance matrix whose $i j$ entry is $k(t_i, t_j)$. We denote by $X \sim \mathrm{GP}(m,k)$ a Gaussian Process $X$ with mean $m$ and covariance $k$. For one-dimensional continuous time series (which we consider in this post), we can assume that $\mathbb{T} = \mathbb{R}$.
The joint probability density of $X(\mathbf{t})$ at a real vector $x(\mathbf{t}) \triangleq [x(t_1), x(t_2), \cdots, x(t_n)]$ is $$ \begin{align} \log p(x(\mathbf{t})) & = -\frac{1}{2} \biggl[ (x(\mathbf{t}) - m(\mathbf{t}))^\mathsf{T} k(\mathbf{t}, \mathbf{t})^{-1} ( x(\mathbf{t}) - m(\mathbf{t}) ) \\ & \qquad\qquad {} - \log (\det k(\mathbf{t}, \mathbf{t})) \\ & \qquad\qquad {} - n \log (2\pi) \biggr] . \end{align} $$ Multivariate Gaussian distributions are closed under conditioning, thus the posterior distribution of $X(\mathbf{t}’)$ at new time indexes $\mathbf{t}’$ is also a multivariate Gaussian: $$ \begin{align} X(\mathbf{t}’) \mid X(\mathbf{t}) = x(\mathbf{t}) & \sim \mathrm{MultivariateGaussian}(m_{post}(\mathbf{t}’), k_{post}(\mathbf{t}’, \mathbf{t}’)), \\ m_{post}(\mathbf{t’}) & \triangleq k(\mathbf{t}’, \mathbf{t}) k(\mathbf{t}, \mathbf{t})^{-1} x(\mathbf{t}), \\ k_{post}(\mathbf{t’}, \mathbf{t}’) & \triangleq k(\mathbf{t}’, \mathbf{t}’) - k(\mathbf{t}’, \mathbf{t}) k(\mathbf{t}, \mathbf{t})^{-1} k(\mathbf{t}, \mathbf{t}’). \end{align} $$
One can easily add i.i.d. noises to a Gaussian Process. Consider $Z \sim \mathrm{GP}(m, k)$ and $X(t) \triangleq Z(t) + \gamma(t)$ where $\gamma(t) \sim \mathrm{Gaussian}(0, \varepsilon)$ for any $t \in \mathbb{T}$. We then say $X$ is a Gaussian Process with output noise $\varepsilon$, written $X \sim \mathrm{NoisyGP}(m,k,\varepsilon)$. One can show that $X$ itself is indeed a Gaussian Process $X \sim \mathrm{GP}(m,k’)$ where $k’$ is defined as $k’(t,t’) \triangleq k(t,t’) + [t=t’] \cdot \varepsilon$ for all $t,t’ \in \mathbb{T}$.
See the compute_log_likelihood
function and the get_conditional_mu_cov
function in the notebook for computing likelihoods and posterior parameters.
GP Synthesis via MCMC
There have been many methods for defining the covariance function of a Gaussian Process. The grammar below shows one method using a domain-specific language (DSL), thus enabling an expressive set of (possibly very complex) covariance functions: $$ \begin{align} E & ::= \texttt{NoisyGP}(0, K, \varepsilon) \\ K & ::= \texttt{Constant}(\varphi) \mid \texttt{Linear}(\theta) \mid \texttt{Squared_exponential}(\varphi) \\ & \mid \texttt{Periodic}(\varphi_1, \varphi_2) \mid K_1 + K_2 \mid K_1 * K_2 \\ & \varepsilon \in \mathbb{R}_{> 0} \qquad \theta \in \mathbb{R} \qquad \varphi \in \mathbb{R}_{> 0} \end{align} $$
- $\texttt{Constant}(\varphi)$: the parameter $\varphi$ means the variance of a constant process around $0$.
- $\texttt{Linear}(\theta)$: the parameter $\theta$ means the time intercept of a linear process, i.e., $X(\theta) = 0$ with probability one.
- $\texttt{Squared_exponential}(\varphi)$: the parameter $\varphi$ means the length scale of a stationary smooth process.
- $\texttt{Periodic}(\varphi_1, \varphi_2)$: the parameter $\varphi_1$ means the length scale of a periodic process, and the parameter $\varphi_2$ represents the frequency of the process.
In a Bayesian-style time-series analysis, we model unknown quantities as random variables and known data as generated observations. For any $n > 0$ and distinct time indexes $\mathbf{t} = [t_1, t_2, \cdots, t_n]$, the generative model is
- Sample a noise level $\varepsilon \sim P(\varepsilon)$ (e.g., a Gamma prior).
- Sample a covariance expression $K \sim P(K)$ (e.g., the
covariance_prior
function in the notebook, which implements a probabilistic context-free grammar). - Sample time-series data $X(\mathbf{t}) \sim \mathrm{NoisyGP}(0, K, \varepsilon)$. (See the
eval_cov_mat
function in the notebook for how to translate an expression in the DSL to a covariance function.)
Running Bayesian inference on the generative model above with the real vector $x(\mathbf{t}) = [x(t_1), x(t_2), \cdots, x(t_n)]$ essentially performs a Bayesian synthesis of the covariance expression $K$.
Markov-chain Monte Carlo (MCMC) is a popular sampling-based method for Bayesian inference; it allows us to sample from the posterior distribution $P(\varepsilon, K \mid X(\mathbf{t}) = x(\mathbf{t}))$.
MCMC works by constructing a Markov chain over latent random variables ($\varepsilon$ and $K$ in this case),
and the chain is usually generated from a proposal kernel that generates new candidates for latent variables from their current values.
The proposed new values are accepted with respect to the Metropolis-Hastings criterion.
The mh_resample_noise
function in the notebook simply generates a fresh noise level from a Gamma distribution.
The mh_resample_subtree_unbiased
function implements a more involved mechanism that unbiasedly selects a node in the AST of the covariance expression $K$
and replaces the subtree rooted at that node with a freshly generated sub-expression.
The run_mcmc
function then runs an epoch of the MCMC method by resampling repeatedly for a few times.
The last cell of the notebook demonstrates the effectiveness of the Bayesian-synthesis approach on a time-series analysis of revenue passenger miles for US air carrier domestic flights.
Next Steps
- How to write tensor-manipulation code more easily in a statically and strongly typed language like OCaml?
- How to develop a lightweight mechanism that checks tensor shape compatibility at compile time?
- The plotting functionality provided in this post largely depends on the Python ecosystem. How to design a plotting library with APIs that are more suitable for typed functional programming?
-
Feras A. Saad, Marco F. Cusumano-Towner, Ulrich Schaechtle, Martin C. Rinard, and Vikash K. Mansinghka. 2019. Bayesian Synthesis of Probabilistic Programs for Automatic Data. POPL'19. ↩︎