Literature notes
One day I decided to put all my notes in one place, let's see how it turns out. These might well be incomplete, erroneous. and make not a favor to the authors. Don't blame anyone for the stuff written below.
\(\def\textsc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}}\)
Table of Contents
- 1. Estimation and Model Identification for Continuous Spatial
- 2. Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes
- 3. Vecchia Approximations of Gaussian-Process Predictions
- 4. Computer experiments with functional inputs and scalar outputs
- 5. Wavelets and their Applications
- 6. Log Gaussian Cox processes on the sphere
1. Estimation and Model Identification for Continuous Spatial
A. V. Vecchia 1988 https://doi.org/10.1111/j.2517-6161.1988.tb01729.x
1.1. Introduction
- An empirical variogram is basically a method of moments estimate of the true variogram, so nicely put :)
- \(n\) observations, \(m\) neighbors s.t. \(m << n\)
- Approximate likelihood functions \(L_m \to L\) as \(m \to n\)
- Iterative estimation whereby estimates based on \(L_m\) are used as initial values for estimates based on \(L_{m + 1}\)
1.2. Estimation
1.2.1. Approximate likelihood function
Let \(\mathbf{z} = \xi(\mathbf{x}) + \mathbf{\eta}\), \(\xi\) is a spatial component and \(\mathbf{\eta}\) an iid error1. The exact likelihood is
\begin{align} L(\mathbf{z}) &= {\left(2\pi\right)}^{-n/2} {\left(\sigma^2\gamma\right)}^{-n/2} {\left|\mathbf{R} + \nu^2 \mathbf{I} \right|}^{-1/2} \exp\left\{ {-(2\sigma^2\gamma)}^{-1} \mathbf{z}^\top {(\mathbf{R} + \nu^2 \mathbf{I})}^{-1} \mathbf{z} \right\} \\ \gamma &= \text{var}\left(\xi\right) / \sigma^2 \\ \mathbf{R} &= \text{cor}\left(\xi\right) \\ \nu^2 &= \sigma^2_\eta / \text{var}\left(\xi\right) \end{align}In terms of conditional normal probability functions \(p(\cdot | \cdot)\),
\begin{align} L(z) &= \prod_i^n p(\mathbf{z}_i | \left\{z_j \forall j \ne i \right\}) \\ &\approx \prod_i^n p(\mathbf{z}_i | \mathbf{z}_{im}) \end{align}where \(z_{i}\) contains all but the \(i\)-th observation and \(z_{im}\) contains a \(m\) neighbors around the $i$-th observation2. Select observations to include in \(\mathbf{z}_{im}\) to some computationally thrifty criterion. If the criterion depends on the model form and parameter values (e.g., the length-scale), it can cause instabilities in iterative parameter estimation procedure. For example, select the \(m\) observations which smallest Euclidean distance in the inputs. The exact likelihood equation is invariant to permutations of the observations but the approximate equation is not, especially for small \(m\). Systematic ordering rules include lexicographical ordering in the input values (e.g., coordinates), or subseting from observations within a fixed distance (e.g., for lattice data).
1.2.2. Iterative maximum likelihood estimation
The \(m\)-order approximate maximum likelihood estimates \(\hat{\mathbf{\psi}}_m\) are obtained by numerically maximizing the approximate likelihood function \(L_m\) for fixed \(m << n\). Start with \(m=1\) with some initial values \(\mathbf{\psi}_0\), estimate iteratively3 \(\hat{\mathbf{\psi}}_{m + 1}\) by numerically maximizing \(L_{m + 1}\) with starting values \(\hat{\mathbf{\psi}}_{m}\). Increase \(m\) until convergence is achieved based on some statistic and an arbitrarily small change, e.g., \(\Lambda_m = -2 \log L_m\left(\hat{\mathbf{\psi}}_{m}\right)\) where \(\Lambda_m \to -2 \log \hat{L}_n\) as \(m \to n\) approaches the maximized exact likelihood function. The application section monitors Akaike's information criterion \(A_m = \Lambda_m + 2 |\mathbf{\psi}|\).
2. Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes
Andrew O. Finley, Abhirup Datta, Bruce D. Cook, Douglas C. Morton, Hans E. Andersen & Sudipto Banerjee 2019 https://doi.org/10.1080/10618600.2018.1537924
2.1. Introduction
- exploit high-performance computing libraries to obviate expensive numerical linear algebra (e.g., expensive matrix multiplications and factorizations)
- outlining three alternate formulations that are significantly more efficient for practical implementation than Datta et al. (2016a)
- application to 5 million locations
2.2. Nearest Neighbor Gaussian Processes
Let \(y(\mathbf{s}_i)\) and \(\mathbf{x}(\mathbf{s}_i)\) denote the response and the predictors at location \(\mathbf{s}_i\) for \(i = 1, \dots, n\), \(y(\mathbf{s}_i) = {\mathbf{x}(\mathbf{s}_i)}^\top \mathbf{\beta} + w(\mathbf{s}_i) + \varepsilon(\mathbf{s}_i)\) with iid noise \(\varepsilon(\mathbf{s}_i)\), a spatial random effect \(w(\mathbf{s}_i) \sim \text{GP}(0, C(\cdot | \mathbf{\theta}))\), and posterior
\begin{align} p(\mathbf{\beta}, \mathbf{\theta}, \tau^2) \times \textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C}) \times N(\mathbf{y} | \mathbf{X}\mathbf{\beta} + \mathbf{w}, \tau^2 \mathbf{I}) \end{align}for some prior \(p(\mathbf{\beta}, \mathbf{\theta}, \tau^2)\), Limitations: (i) \(\mathbf{C}\) requires \(O(n^2)\) dynamic memory, (ii) \(\textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C})\) involves factorizations (e.g., Cholesky) that require \(O(n^3)\) flops, (iii) predicting at \(K\) new locations require \(O(Kn^2)\) flops.
Note that \(\textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C})\) can be rewritten as \(\mathbf{w} = \mathbf{A}\mathbf{w} + \mathbf{\eta}\), where \(\mathbf{A}\) is \(n \times n\) strictly lower-triangular with elements aij = 0, whenever \(j \le i\) and \(\eta \sim \textsc{N}(0, \mathbf{D})\) and \(\mathbf{D}\) is diagonal with entries \(d_{11} = \text{var}(w_1)\) and \(d_{ii} = \text{var}(w_i | \{w_j : j < i\})\). Then, \(\mathbf{C} = {(\mathbf{I} − \mathbf{A})}^{-1} \mathbf{D}{(I − A)}^{-\top}\) with \(D_{1,1} = C_{1,1}\) and \(\mathbf{A}\textsc{[}1, \cdot\textsc{]} = \mathbf{0}\) (null top row).
Setting some elements in the lower triangular part of \(\mathbf{A}\) to be zero reduces inversion complexity from \(O(n^3)\) to \(O(n m^3)\), where \(m\) is the number of nonzero elements in each row of \(\mathbf{A}\). With \(\tilde{\mathbf{C}} = {(\mathbf{I} − \mathbf{A})}^{-1} \mathbf{D}{(I − A)}^{-\top}\), \(\tilde{\mathbf{C}}^{-1} = {(\mathbf{I} − \mathbf{A})}^{\top} \mathbf{D}^{-1}{(I − A)}\) becomes sparse reducing the computation of quadratic forms \(\mathbf{u}^{\top} \tilde{\mathbf{C}}^{-1} \mathbf{v}\) from \(O(n^2)\) to \(O(nm))\). Also, \(\text{det}(\tilde{\mathbf{C}}) = \prod_{i=1}^n d_{ii}\). Note that while \(\tilde{\mathbf{C}}\) need not be sparse, \(\textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C})\) is evaluated in linear time \(O(n)\).
The MCMC implementation in Datta et al. (2016a), which requires updating the \(n\) latent spatial effects \(\mathbf{w}\) sequentially, often is slow convergence. Three alternative models exploit marginalizing out the vector of spatial random effects.
2.2.1. Collapsed NNGP
Consider \(\textsc{N}(\mathbf{y} | \mathbf{X}\mathbf{\beta} + \mathbf{w}, \tau^2 \mathbf{I}) \times \textsc{N}(\mathbf{w} | \mathbf{0}, \tilde{\mathbf{C}})\), then \(\mathbf{y} \sim N(\mathbf{X}\mathbf{\beta}, \mathbf{\Lambda})\) where \(\mathbf{\Lambda} = \tilde{\mathbf{C}} + \tau^2 \mathbf{I}\). This model has \(p + 4 < n + p + 4, p = |\mathbf{\beta}\) parameters. The exact computational cost, which depends on the data design via the location of the nonzero entries in the Cholesky factor, can be more than linear. See Algorithm 1 in paper for a Gibbs sampler with Metropolis-Hastings update, Algorithm 2 for inference on the spatial effect and prediction at new locations.
2.2.2. NNGP for the response
Apply nearest neighbor approximation directly to the marginal likelihood of \(\mathbf{y}\), i.e., \(y(\mathbf{s}) \sim \textsc{N}({\mathbf{x}(\mathbf{s})}^{\top}, \mathbf{\Sigma})\) using a sparse covariance matrix \(\tilde{\mathbf{\Sigma}} \approx \mathbf{\Sigma} = \mathbf{C} + \tau^2 \mathbf{I}\) . Cannot recover \(\mathbf{w}\), use when the interest is on prediction. Computationally parsimonious solution for fully Bayesian analysis of massive spatial datasets. See Algorithm 3 and 4 for MCMC and posterior predictive inference on the response.
2.2.3. Conjugate NNGP for MCMC-free exact Bayesian inference
A hybrid cross-validation approach used when inference on covariance parameters is of little interest. See Algorithm 5.
2.3. Implementation
C++ with openBLAS
and LAPACK
for matrix algebra, openMP
for
parallelization (omp for
for precision matrix components, omp
for
with reduction
clause to compute quadratic form in
Pseudocode 3). Use an efficient search algorithm for fast neighest
neighbors rather than bruteforcing.
- Experiment #1 (model run time)
- permutation matrix has large impact con efficiency
- response is dramatically faster than collapsed
- marginal and negligible improvement beyond 6 and 12 CPUs, overhead if more
- point of diminishing returns vary with \(n\)
- run time: 13, 13, 95 seconds per iteration with a \(n = 10^7\) (ten million!) for response, conjugate, and collapse
- Experiment #2 (prediction)
- all methods yield similar statistics
- TIU dataset
- 5 million data points
- \(m = 15\) and \(m = 25\) produce indistinguishable results
- similar, if not identical, prediction statistics across all methods
- run time: 318, 38, 2E-3 hours for collapse (25,000 samples), response, and conjugate NNGP
2.4. Summary
- collapsed is the only fully Bayesian alternative for inference on the spatial random effects
- conjugate model's uncompromised predictive inference requires the specification of some model parameters
- the
spNNGP
R package implements response and conjugate NNGP - for nonstationary processes, dynamic neighbor-finding algorithms may offer better starting point
3. Vecchia Approximations of Gaussian-Process Predictions
Matthias Katzfuss, Joseph Guinness, Wenlong Gong & Daniel Zilber 2020 https://doi.org/10.1007/s13253-020-00401-7
3.1. Introduction
- GP approximations
- sparse covariance matrix, sparse precision matrix, composite likelihood, low-rank structure.
- Low-rank approaches are poorly suited for fine-scale dependence
- Sparse methods cannot guarantee linear times
- Local GP approximations do not scale well to joint predictions at many locations
- Vecchia approximation
- Training: sparse Cholesky factor of the precision matrix by removing conditioning variables in a factorization of the observations joint density into a product of conditional distributions
- Prediction: Vecchia's one-at-a-time has squared time complexity, Finley one-for-one output sampling from the posterior predictive distributions using a covariance matrix constructed only with neighbors, conditional expectation and uncertainty quantification via conditional simulations, domain partitioning approximations
- This paper approach guarantees sparsity of the matrices necessary for inference, resulting in linear memory and time complexity in the number of data points and predictions for fixed conditioning-set size
3.2. General Vecchia Framework
- For a vector \(\mathbf{x}\) with precision matrix \(\mathbf{Q}\), \(\mathbf{U} = \textsc{rchol}\left(\mathbf{Q}\right)\) is the upper-lower Cholesky decomposition of \(\mathbf{Q}\) s.t. \(\mathbf{Q} = \mathbf{U}\mathbf{U}^{\top}\). Nonzero entries of \(\mathbf{U}\) can be computed directly, i.e., the matrix \(\mathbf{Q}\) need not be constructed.
4. Computer experiments with functional inputs and scalar outputs
Thomas Muehlenstaedt, Jana Fruth & Olivier Roustant 2016 https://doi.org/10.1007/s11222-016-9672-z
4.1. Background and notation
4.1.1. Some basics on B-spiles
B-splines are always bounded. Let \(f(t) = \sum_{i=1}^K \beta_i B_{i, m}(t)\), where \(B_{i, m}, i = 1, \dots, K \ge m\) are B-spline basis functions of order \(m\) (\(m = 1\) is piecewise constant function) and \(\mathbf{\beta} = (\beta_1, \dots, \beta_K)\) is the vector of basis coefficients.
The basis functions are defined over a sequence of \(K - m + 2\) knots with \(m - 1\) replicates for the first and last knot. \(\tau_1 = \dots = \tau_m < \tau_{m + 1} < \dots \tau_{K} \dots < \tau_{K+m}\). The basis functions are found recursively (see paper). At each point \(t\), \(\sum_{i=1}^K B_{i, m}(t) = 1\). If \(\mathbf{\beta} \in [0, 1]^K\), then \(f(t) \in [0, 1]\).
4.1.2. Distance-based approach
The \(L_2\) norm for functions is \[ D(f, \tilde{f}) = \sqrt{ \int_0^1 \left(f(t) - \tilde{f}(t)\right)^2 \ dt } \] which simplifies to \(\int_0^1 \sum_{i,j} \delta_i \delta_j b_i(t) b_j(t) \mathrm{d}t = \mathbf{\delta}^\top \mathbf{J} \mathbf{\delta}\) where \(\delta_i = \beta_i - \tilde{\beta}_j\) (see paper eq. 6).
4.2. Surrogate models
4.2.1. Weighting
The integral needs to be computed numerically4. At this point, they make an assumption from paper eq. (13) to (14) that doesn't fit what I need.
5. Wavelets and their Applications
Michel Misiti, Yves Misiti, Georges Oppenheim, Jean-Michel Poggi 2010 ISBN: 978-0-470-61249-1
- a wavelet is a function oscillating as a wave but quickly damped,
hence well localized simultaneously in time and frequency
- Would this be related with horizontal and vertical variability as in jointFPCA?
- Denoising or function estimation: calculate the wavelet transform of observations, modify the coefficients profiting from their local nature, and reverse the transformation
- Wavelets provide a generally very sparse representation, reducing the volume of information to be coded
5.1. A tour and mathematical framework
Let \(\psi(t): \mathcal{T} \to \mathbb{R}\) be a real wavelet satisfying the sufficient admissibility conditions \(\psi \in L^1 \cap L^2\), \(t\,\psi(t) \in L^1\), and \(\int_{\mathbb{R}} \psi(t)\,\mathrm{d}t = 0\). From a single function \(\psi(t)\) we build a family of forms via translation and dilation with scale \(a\) and position \(b\)
- Wavelets integrate to zero in frequency domain, i.e., \(\int_{\mathbb{R}}\psi(t)\,\mathbb{d}t = 0\)
- We often require wavelets to have \(m\) vanishing moments, i.e., \(\int_{\mathbb{R}} t^k \psi(t)\,\mathrm{d}t = 0\) for \(k = 0, \dots, m\). The higher \(m\) is, the more \(\psi(t)\) oscillates.
- Consider a wavelet that oscillates like a wave and is then
locatalized due to damping
- Oscillation is measured by the number of vanishing moments \(m\)
- Localization is evaluated in the interval where the wavelet takes values significantly different from zero
Define \(C_f(a, b)\) the continuous wavelet transform of a finite-energy (square integrable) function \(f\),
- We analyze \(f(t)\) with the wavelet \(\psi(t)\)
- The wavelet coefficients \(C_f(a, b)\) measure the fluctuations of function \(f\) at scale \(a\)
- It eliminates the trend at scale \(a\) containing slower evolutions to facilitate a local analysis of \(f\)
- If \(\psi(t) = 0 \,\forall\,t\not\in[-M, M]\), then \(\psi_{a, b}(t) = 0 \,\forall\,t\not\in[-Ma + b, Ma+b]\). That is, \(C_f(a, b)\) depends on \(f\) in a neighborhood \(b\) with a length proportional to \(a\).
- Large values of \(C_f(a, b)\) provide information on the local irregularity of \(f\) around position \(b\) at scale \(a\).
Restrict \(a = 2^j\), \(b = k\,2^j = k\,a\,\forall\,(j,k)\in\mathbb{Z}^2\). Assume \(\psi(t)\) has \(m\) vanishing moments and is orthogonal to the polynomials of degree lower of equal to \(m\). Define the scaling function \(\phi(t)\) and their corresponding dilations and translations \(\phi_{a, b}(t)\). The scaling function \(\phi_{a, b}(t)\) measures the local approximation of function \(f\) in a neighborhood \(b\) at scale \(a\) needs checking.
- The wavelet \(\psi(t)\) oscillates and integrates to zero while the scaling function \(\phi(t)\) oscillates less and has positive integral (typically, equal to 1).
- The scalar product \(f(t)\,\psi(t)\) captures the fluctuations of \(f\) around a local average given by the scalar product \(f(t)\,\phi(t)\).
Define the basic atoms of wavelets
The wavelet coefficients \(\alpha_{j,k} = \int_{\mathbb{R}} f(t)\,\psi_{j,k}(t)\,\mathrm{d}t\) and can be computed via a fast algorithm involving four filters: low/high-pass decomposition, low/high-pass reconstruction. Under some conditions, the signal can be reconstructed with
These are organized in a tree with:
- time \(k = b/a = 2^{-j}a\) used to translate the forms for a given level \(j\), and
- scale \(a = 2^j\) used to pass from level \(j\) to the immediately lower level
The local averages of \(f\) are defined by the approximation coefficient \(\beta_{j,k} = \int_{\mathbb{R}} f(t)\phi_{j,k}(t)\,\mathrm{d}t\) and the approximation signal \(A_j(t) = \sum_{k\in\mathbb{Z}} \beta_{j,k} \psi_{j,k}(t) \,\mathrm{d}t\). Differences in two successive local averages are defined by the wavelet or detail coefficients \(\alpha_{j,k} = \int_{\mathbb{R}} f(t) \psi_{j,k}(t) \,\mathrm{d}t\) and the detail signal \(D_j(t) = \sum_{k\in\mathbb{Z}} \alpha_{j,k} \psi_{j,k}(t)\). The approximation and detail signals are a function of time \(t\) as the original signal \(f(t)\) is, but the coefficients of the level \(j\) are dyadic in time \(2^{j}\mathbb{Z} = \{m\,2^j: m \in \mathbb{Z}\}\) (the subgroup of all integers that are multiples of \(2^j\), say \(\dots\), -16, -8, 0, 8, 16, \(\dots\) for \(j=2\)). To smooth a signal, truncate the reconstruction with only \(J\) wavelet coefficients \(\hat{f}(t) = \sum_{1\le j\le J}\sum_{k}\alpha_{j,k}\psi_{j,k}(t)\). The larger (smaller) \(j\) is, the coarser (finer) the approximation and detail are.
The discrete transform is defined
for some fixed \(a_0 > 1\), \(b_0 > 0\), and \(p,n\in\mathbb{Z}\). An usual choice is \(a_0 = 2\) and \(b_0=1\) due to Shannon's sampling theorem.
5.2. Wavelet families
Criterias for choosing a wavelet \(\psi(t)\):
- support and speed of convergence to zero as \(t\to\infty\), which quantifies localization in time
- symmetry, which avoids dephasing
- number of zero moments, which helps with compression
- regularity, which helps with smooth and regular signals
- existence of scaling function \(\phi(t)\)
- orthogonality
- existence of explicit formula
- ease of calculation
Orthogonal wavelets with compact support
- Daubechies wavelets of order \(N\)
- Haat wavelet for \(N = 1\)
- No explicit expression for \(N > 1\)
- Symlets
- coifN
Orthogonal wavelets with non-compact support
- Meyer wavelet
- infinitely derivable
- not compactly supported, but converges fast to zero
- infinitely derivable
- derivatives converge fast to zero
- rapid decay
6. Log Gaussian Cox processes on the sphere
Francisco Cuevas-Pacheco, Jesper Møller 2018 https://doi.org/10.1016/j.spasta.2018.06.002
6.1. Introduction
- Although models and methods developed for planar and spatial point processes may be adapted, statistical methodology for point processes on the sphere is still not so developed
- Focus on Log Gaussian Crox processes for inhomogeneous aggregated/clustered point patterns on the sphere
6.2. Background
Let \(\mathbb{S}^{d}: \{x\in\mathbb{R}^{d+1}: \lvert x \rvert = 1\}\) be a $d$-dimensional unit sphere included in the (\(d+1\))-dimensional Euclidean space equipped with inner product \(\langle x, y\rangle = \sum_{i=0}^{d}{x_i y_i}\) and length \(\lvert x\rvert = \sqrt{\langle x, x\rangle}\). For unit vectors \(u,v\in\mathbb{S}^{d}\), let \(d(u,v) = \arccos(\langle u, v\rangle) = \arccos(\sum_{i=0}^{d}{u_i,v_i})\) be the geodesic distance on the sphere.
For a Poisson process \(\mathbf{X}\in\mathbb{S}^d\) with intensity function \(\lambda\), \(N(\mathbb{S}^d)\) is Poisson distributed with mean \(\int\lambda(u)\,\mathrm{d}u\) and, conditional on \(N(\mathbb{S}^d)\), the points in \(\mathbf{X}\) are iid with density proportal to \(\lambda\).
Let \(\mathbf{Z} = \{\mathbf{Z}(u): u\in\mathbb{S}^d\}\) be a non-negative random field satisfying some conditions. Then \(\mathbf{X} | \mathbf{Z}\) is a Poisson process on \(\mathbb{S}^d\) with intensity function \(\mathbf{Z}\). The intensity of \(\mathbf{X}\) is \(\lambda(u) = \mathbb{E}[\mathbf{Z}(u)]\), and the residual driving random field \(\mathbf{Z}_0(u) = \mathbf{Z} / \lambda(u)\) has pair correlation function \(g(u,v) = \mathbb{E}[\mathbf{Z}_0(u)\mathbf{Z}_0(v)]\).
6.3. Log Gaussian Cox processes
Set \(\mathbb{Z} = \exp(\mathbf{Y})\) where \(Y=\{Y(u): u\in\mathbb{S}^d\}\) is a Gaussian random field. Then, a LGCP \(\mathbf{X}\) has intensity function \(\lambda(u) = \exp(\mu(u) + C(u,u)/2)\) and pair correlation function \(g(u,v) = \exp(C(u,v))\) where \(\mu\) and \(C\) are the mean and covariance functions of the underlaying Gaussian random field.
6.4. Further remarks
For any integer \(n\ge2\), the $n$-th order pair correaltion function is given by \(g(u_1, \dots, u_n) = \exp(\sum_{1\le{i}\le{j}\le{n}} C(u_i,u_j))\). The conditional distribution \(\mathbf{X}(v) | \mathbf{X}(u)\) for \(v\in\{\mathbb{S}^d \backslash {u}\}\) has mean function \(m_u(v)=m(v) + C(u, v)\) and covariance function \(C(u, v)\) (covariance function remains unchanged).
Footnotes:
The original paper has a linear mean model that was assumed to be zero for these notes. GLS for linear coefficients should be easy to plug in.
The paper starts with spatially or temporally ordered neighbors \(\left\{z_j: 1 \le j \le i - 1\right\}\).
No need to increase \(m\) by an unit really.
The paper has the \(\omega(t)\) function inside the squared parentheses.