# Literature notes

One day I decided to put all my notes in one place, let's see how it turns out. These might well be incomplete, erroneous. and make not a favor to the authors. Don't blame anyone for the stuff written below.

\(\def\textsc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}}\)

## Table of Contents

- 1. Estimation and Model Identification for Continuous Spatial
- 2. Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes
- 3. Vecchia Approximations of Gaussian-Process Predictions
- 4. Computer experiments with functional inputs and scalar outputs
- 5. Wavelets and their Applications
- 6. Log Gaussian Cox processes on the sphere

## 1. Estimation and Model Identification for Continuous Spatial

A. V. Vecchia 1988 https://doi.org/10.1111/j.2517-6161.1988.tb01729.x

### 1.1. Introduction

- An empirical variogram is basically a method of moments estimate of the true variogram, so nicely put :)
- \(n\) observations, \(m\) neighbors s.t. \(m << n\)
- Approximate likelihood functions \(L_m \to L\) as \(m \to n\)
- Iterative estimation whereby estimates based on \(L_m\) are used as initial values for estimates based on \(L_{m + 1}\)

### 1.2. Estimation

#### 1.2.1. Approximate likelihood function

Let \(\mathbf{z} = \xi(\mathbf{x}) + \mathbf{\eta}\), \(\xi\) is a
spatial component and \(\mathbf{\eta}\) an iid error^{1}. The
exact likelihood is

In terms of conditional normal probability functions \(p(\cdot | \cdot)\),

\begin{align} L(z) &= \prod_i^n p(\mathbf{z}_i | \left\{z_j \forall j \ne i \right\}) \\ &\approx \prod_i^n p(\mathbf{z}_i | \mathbf{z}_{im}) \end{align}
where \(z_{i}\) contains all but the \(i\)-th observation and \(z_{im}\)
contains a \(m\) neighbors around the $i$-th
observation^{2}. Select observations to include in
\(\mathbf{z}_{im}\) to some computationally thrifty criterion. If
the criterion depends on the model form and parameter values
(e.g., the length-scale), it can cause instabilities in iterative
parameter estimation procedure. For example, select the \(m\)
observations which smallest Euclidean distance in the inputs. The
exact likelihood equation is invariant to permutations of the
observations but the approximate equation is not, especially for
small \(m\). Systematic ordering rules include lexicographical
ordering in the input values (e.g., coordinates), or subseting
from observations within a fixed distance (e.g., for lattice
data).

#### 1.2.2. Iterative maximum likelihood estimation

The \(m\)-order approximate maximum likelihood estimates
\(\hat{\mathbf{\psi}}_m\) are obtained by numerically maximizing the
approximate likelihood function \(L_m\) for fixed \(m << n\). Start
with \(m=1\) with some initial values \(\mathbf{\psi}_0\), estimate
iteratively^{3} \(\hat{\mathbf{\psi}}_{m + 1}\) by numerically
maximizing \(L_{m + 1}\) with starting values
\(\hat{\mathbf{\psi}}_{m}\). Increase \(m\) until convergence is
achieved based on some statistic and an arbitrarily small change,
e.g., \(\Lambda_m = -2 \log
L_m\left(\hat{\mathbf{\psi}}_{m}\right)\) where \(\Lambda_m \to -2
\log \hat{L}_n\) as \(m \to n\) approaches the maximized exact
likelihood function. The application section monitors Akaike's
information criterion \(A_m = \Lambda_m + 2 |\mathbf{\psi}|\).

## 2. Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes

Andrew O. Finley, Abhirup Datta, Bruce D. Cook, Douglas C. Morton, Hans E. Andersen & Sudipto Banerjee 2019 https://doi.org/10.1080/10618600.2018.1537924

### 2.1. Introduction

- exploit high-performance computing libraries to obviate expensive numerical linear algebra (e.g., expensive matrix multiplications and factorizations)
- outlining three alternate formulations that are significantly more efficient for practical implementation than Datta et al. (2016a)
- application to 5 million locations

### 2.2. Nearest Neighbor Gaussian Processes

Let \(y(\mathbf{s}_i)\) and \(\mathbf{x}(\mathbf{s}_i)\) denote the response and the predictors at location \(\mathbf{s}_i\) for \(i = 1, \dots, n\), \(y(\mathbf{s}_i) = {\mathbf{x}(\mathbf{s}_i)}^\top \mathbf{\beta} + w(\mathbf{s}_i) + \varepsilon(\mathbf{s}_i)\) with iid noise \(\varepsilon(\mathbf{s}_i)\), a spatial random effect \(w(\mathbf{s}_i) \sim \text{GP}(0, C(\cdot | \mathbf{\theta}))\), and posterior

\begin{align} p(\mathbf{\beta}, \mathbf{\theta}, \tau^2) \times \textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C}) \times N(\mathbf{y} | \mathbf{X}\mathbf{\beta} + \mathbf{w}, \tau^2 \mathbf{I}) \end{align}for some prior \(p(\mathbf{\beta}, \mathbf{\theta}, \tau^2)\), Limitations: (i) \(\mathbf{C}\) requires \(O(n^2)\) dynamic memory, (ii) \(\textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C})\) involves factorizations (e.g., Cholesky) that require \(O(n^3)\) flops, (iii) predicting at \(K\) new locations require \(O(Kn^2)\) flops.

Note that \(\textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C})\) can be rewritten as \(\mathbf{w} = \mathbf{A}\mathbf{w} + \mathbf{\eta}\), where \(\mathbf{A}\) is \(n \times n\) strictly lower-triangular with elements aij = 0, whenever \(j \le i\) and \(\eta \sim \textsc{N}(0, \mathbf{D})\) and \(\mathbf{D}\) is diagonal with entries \(d_{11} = \text{var}(w_1)\) and \(d_{ii} = \text{var}(w_i | \{w_j : j < i\})\). Then, \(\mathbf{C} = {(\mathbf{I} − \mathbf{A})}^{-1} \mathbf{D}{(I − A)}^{-\top}\) with \(D_{1,1} = C_{1,1}\) and \(\mathbf{A}\textsc{[}1, \cdot\textsc{]} = \mathbf{0}\) (null top row).

Setting some elements in the lower triangular part of \(\mathbf{A}\) to be zero reduces inversion complexity from \(O(n^3)\) to \(O(n m^3)\), where \(m\) is the number of nonzero elements in each row of \(\mathbf{A}\). With \(\tilde{\mathbf{C}} = {(\mathbf{I} − \mathbf{A})}^{-1} \mathbf{D}{(I − A)}^{-\top}\), \(\tilde{\mathbf{C}}^{-1} = {(\mathbf{I} − \mathbf{A})}^{\top} \mathbf{D}^{-1}{(I − A)}\) becomes sparse reducing the computation of quadratic forms \(\mathbf{u}^{\top} \tilde{\mathbf{C}}^{-1} \mathbf{v}\) from \(O(n^2)\) to \(O(nm))\). Also, \(\text{det}(\tilde{\mathbf{C}}) = \prod_{i=1}^n d_{ii}\). Note that while \(\tilde{\mathbf{C}}\) need not be sparse, \(\textsc{N}(\mathbf{w} | \mathbf{\theta}, \mathbf{C})\) is evaluated in linear time \(O(n)\).

The MCMC implementation in Datta et al. (2016a), which requires updating the \(n\) latent spatial effects \(\mathbf{w}\) sequentially, often is slow convergence. Three alternative models exploit marginalizing out the vector of spatial random effects.

#### 2.2.1. Collapsed NNGP

Consider \(\textsc{N}(\mathbf{y} | \mathbf{X}\mathbf{\beta} + \mathbf{w}, \tau^2 \mathbf{I}) \times \textsc{N}(\mathbf{w} | \mathbf{0}, \tilde{\mathbf{C}})\), then \(\mathbf{y} \sim N(\mathbf{X}\mathbf{\beta}, \mathbf{\Lambda})\) where \(\mathbf{\Lambda} = \tilde{\mathbf{C}} + \tau^2 \mathbf{I}\). This model has \(p + 4 < n + p + 4, p = |\mathbf{\beta}\) parameters. The exact computational cost, which depends on the data design via the location of the nonzero entries in the Cholesky factor, can be more than linear. See Algorithm 1 in paper for a Gibbs sampler with Metropolis-Hastings update, Algorithm 2 for inference on the spatial effect and prediction at new locations.

#### 2.2.2. NNGP for the response

Apply nearest neighbor approximation directly to the marginal likelihood of \(\mathbf{y}\), i.e., \(y(\mathbf{s}) \sim \textsc{N}({\mathbf{x}(\mathbf{s})}^{\top}, \mathbf{\Sigma})\) using a sparse covariance matrix \(\tilde{\mathbf{\Sigma}} \approx \mathbf{\Sigma} = \mathbf{C} + \tau^2 \mathbf{I}\) . Cannot recover \(\mathbf{w}\), use when the interest is on prediction. Computationally parsimonious solution for fully Bayesian analysis of massive spatial datasets. See Algorithm 3 and 4 for MCMC and posterior predictive inference on the response.

#### 2.2.3. Conjugate NNGP for MCMC-free exact Bayesian inference

A hybrid cross-validation approach used when inference on covariance parameters is of little interest. See Algorithm 5.

### 2.3. Implementation

C++ with `openBLAS`

and `LAPACK`

for matrix algebra, `openMP`

for
parallelization (`omp for`

for precision matrix components, ```
omp
for
```

with `reduction`

clause to compute quadratic form in
Pseudocode 3). Use an efficient search algorithm for fast neighest
neighbors rather than bruteforcing.

- Experiment #1 (model run time)
- permutation matrix has large impact con efficiency
- response is dramatically faster than collapsed
- marginal and negligible improvement beyond 6 and 12 CPUs, overhead if more
- point of diminishing returns vary with \(n\)
- run time: 13, 13, 95 seconds per iteration with a \(n = 10^7\) (ten million!) for response, conjugate, and collapse

- Experiment #2 (prediction)
- all methods yield similar statistics

- TIU dataset
- 5 million data points
- \(m = 15\) and \(m = 25\) produce indistinguishable results
- similar, if not identical, prediction statistics across all methods
- run time: 318, 38, 2E-3 hours for collapse (25,000 samples), response, and conjugate NNGP

### 2.4. Summary

- collapsed is the only fully Bayesian alternative for inference on the spatial random effects
- conjugate model's uncompromised predictive inference requires the specification of some model parameters
- the
`spNNGP`

R package implements response and conjugate NNGP - for nonstationary processes, dynamic neighbor-finding algorithms may offer better starting point

## 3. Vecchia Approximations of Gaussian-Process Predictions

Matthias Katzfuss, Joseph Guinness, Wenlong Gong & Daniel Zilber 2020 https://doi.org/10.1007/s13253-020-00401-7

### 3.1. Introduction

- GP approximations
- sparse covariance matrix, sparse precision matrix, composite likelihood, low-rank structure.
- Low-rank approaches are poorly suited for fine-scale dependence
- Sparse methods cannot guarantee linear times
- Local GP approximations do not scale well to joint predictions at many locations

- Vecchia approximation
- Training: sparse Cholesky factor of the precision matrix by removing conditioning variables in a factorization of the observations joint density into a product of conditional distributions
- Prediction: Vecchia's one-at-a-time has squared time complexity, Finley one-for-one output sampling from the posterior predictive distributions using a covariance matrix constructed only with neighbors, conditional expectation and uncertainty quantification via conditional simulations, domain partitioning approximations

- This paper approach guarantees sparsity of the matrices necessary for inference, resulting in linear memory and time complexity in the number of data points and predictions for fixed conditioning-set size

### 3.2. General Vecchia Framework

- For a vector \(\mathbf{x}\) with precision matrix \(\mathbf{Q}\), \(\mathbf{U} = \textsc{rchol}\left(\mathbf{Q}\right)\) is the upper-lower Cholesky decomposition of \(\mathbf{Q}\) s.t. \(\mathbf{Q} = \mathbf{U}\mathbf{U}^{\top}\). Nonzero entries of \(\mathbf{U}\) can be computed directly, i.e., the matrix \(\mathbf{Q}\) need not be constructed.

## 4. Computer experiments with functional inputs and scalar outputs

Thomas Muehlenstaedt, Jana Fruth & Olivier Roustant 2016 https://doi.org/10.1007/s11222-016-9672-z

### 4.1. Background and notation

#### 4.1.1. Some basics on B-spiles

B-splines are always bounded. Let \(f(t) = \sum_{i=1}^K \beta_i B_{i, m}(t)\), where \(B_{i, m}, i = 1, \dots, K \ge m\) are B-spline basis functions of order \(m\) (\(m = 1\) is piecewise constant function) and \(\mathbf{\beta} = (\beta_1, \dots, \beta_K)\) is the vector of basis coefficients.

The basis functions are defined over a sequence of \(K - m + 2\) knots with \(m - 1\) replicates for the first and last knot. \(\tau_1 = \dots = \tau_m < \tau_{m + 1} < \dots \tau_{K} \dots < \tau_{K+m}\). The basis functions are found recursively (see paper). At each point \(t\), \(\sum_{i=1}^K B_{i, m}(t) = 1\). If \(\mathbf{\beta} \in [0, 1]^K\), then \(f(t) \in [0, 1]\).

#### 4.1.2. Distance-based approach

The \(L_2\) norm for functions is \[ D(f, \tilde{f}) = \sqrt{ \int_0^1 \left(f(t) - \tilde{f}(t)\right)^2 \ dt } \] which simplifies to \(\int_0^1 \sum_{i,j} \delta_i \delta_j b_i(t) b_j(t) \mathrm{d}t = \mathbf{\delta}^\top \mathbf{J} \mathbf{\delta}\) where \(\delta_i = \beta_i - \tilde{\beta}_j\) (see paper eq. 6).

### 4.2. Surrogate models

#### 4.2.1. Weighting

The integral needs to be computed numerically^{4}. At this point, they
make an assumption from paper eq. (13) to (14) that doesn't fit
what I need.

## 5. Wavelets and their Applications

Michel Misiti, Yves Misiti, Georges Oppenheim, Jean-Michel Poggi 2010 ISBN: 978-0-470-61249-1

- a wavelet is a function oscillating as a wave but quickly damped,
hence well localized simultaneously in time and frequency
- Would this be related with horizontal and vertical variability as in jointFPCA?

- Denoising or function estimation: calculate the wavelet transform of observations, modify the coefficients profiting from their local nature, and reverse the transformation
- Wavelets provide a generally very sparse representation, reducing the volume of information to be coded

### 5.1. A tour and mathematical framework

Let \(\psi(t): \mathcal{T} \to \mathbb{R}\) be a real wavelet satisfying the sufficient admissibility conditions \(\psi \in L^1 \cap L^2\), \(t\,\psi(t) \in L^1\), and \(\int_{\mathbb{R}} \psi(t)\,\mathrm{d}t = 0\). From a single function \(\psi(t)\) we build a family of forms via translation and dilation with scale \(a\) and position \(b\)

- Wavelets integrate to zero in frequency domain, i.e., \(\int_{\mathbb{R}}\psi(t)\,\mathbb{d}t = 0\)
- We often require wavelets to have \(m\) vanishing moments, i.e., \(\int_{\mathbb{R}} t^k \psi(t)\,\mathrm{d}t = 0\) for \(k = 0, \dots, m\). The higher \(m\) is, the more \(\psi(t)\) oscillates.
- Consider a wavelet that oscillates like a wave and is then
locatalized due to damping
- Oscillation is measured by the number of vanishing moments \(m\)
- Localization is evaluated in the interval where the wavelet takes values significantly different from zero

Define \(C_f(a, b)\) the continuous wavelet transform of a finite-energy (square integrable) function \(f\),

- We analyze \(f(t)\) with the wavelet \(\psi(t)\)
- The wavelet coefficients \(C_f(a, b)\) measure the fluctuations of function \(f\) at scale \(a\)
- It eliminates the trend at scale \(a\) containing slower evolutions to facilitate a local analysis of \(f\)
- If \(\psi(t) = 0 \,\forall\,t\not\in[-M, M]\), then \(\psi_{a, b}(t) = 0 \,\forall\,t\not\in[-Ma + b, Ma+b]\). That is, \(C_f(a, b)\) depends on \(f\) in a neighborhood \(b\) with a length proportional to \(a\).
- Large values of \(C_f(a, b)\) provide information on the local irregularity of \(f\) around position \(b\) at scale \(a\).

Restrict \(a = 2^j\), \(b = k\,2^j = k\,a\,\forall\,(j,k)\in\mathbb{Z}^2\). Assume \(\psi(t)\) has \(m\) vanishing moments and is orthogonal to the polynomials of degree lower of equal to \(m\). Define the scaling function \(\phi(t)\) and their corresponding dilations and translations \(\phi_{a, b}(t)\). The scaling function \(\phi_{a, b}(t)\) measures the local approximation of function \(f\) in a neighborhood \(b\) at scale \(a\) needs checking.

- The wavelet \(\psi(t)\) oscillates and integrates to zero while the scaling function \(\phi(t)\) oscillates less and has positive integral (typically, equal to 1).
- The scalar product \(f(t)\,\psi(t)\) captures the fluctuations of \(f\) around a local average given by the scalar product \(f(t)\,\phi(t)\).

Define the *basic atoms of wavelets*

The wavelet coefficients \(\alpha_{j,k} = \int_{\mathbb{R}} f(t)\,\psi_{j,k}(t)\,\mathrm{d}t\) and can be computed via a fast algorithm involving four filters: low/high-pass decomposition, low/high-pass reconstruction. Under some conditions, the signal can be reconstructed with

These are organized in a tree with:

- time \(k = b/a = 2^{-j}a\) used to translate the forms for a given level \(j\), and
- scale \(a = 2^j\) used to pass from level \(j\) to the immediately lower level

The local averages of \(f\) are defined by the approximation coefficient \(\beta_{j,k} = \int_{\mathbb{R}} f(t)\phi_{j,k}(t)\,\mathrm{d}t\) and the approximation signal \(A_j(t) = \sum_{k\in\mathbb{Z}} \beta_{j,k} \psi_{j,k}(t) \,\mathrm{d}t\). Differences in two successive local averages are defined by the wavelet or detail coefficients \(\alpha_{j,k} = \int_{\mathbb{R}} f(t) \psi_{j,k}(t) \,\mathrm{d}t\) and the detail signal \(D_j(t) = \sum_{k\in\mathbb{Z}} \alpha_{j,k} \psi_{j,k}(t)\). The approximation and detail signals are a function of time \(t\) as the original signal \(f(t)\) is, but the coefficients of the level \(j\) are dyadic in time \(2^{j}\mathbb{Z} = \{m\,2^j: m \in \mathbb{Z}\}\) (the subgroup of all integers that are multiples of \(2^j\), say \(\dots\), -16, -8, 0, 8, 16, \(\dots\) for \(j=2\)). To smooth a signal, truncate the reconstruction with only \(J\) wavelet coefficients \(\hat{f}(t) = \sum_{1\le j\le J}\sum_{k}\alpha_{j,k}\psi_{j,k}(t)\). The larger (smaller) \(j\) is, the coarser (finer) the approximation and detail are.

The discrete transform is defined

for some fixed \(a_0 > 1\), \(b_0 > 0\), and \(p,n\in\mathbb{Z}\). An usual choice is \(a_0 = 2\) and \(b_0=1\) due to Shannon's sampling theorem.

### 5.2. Wavelet families

Criterias for choosing a wavelet \(\psi(t)\):

- support and speed of convergence to zero as \(t\to\infty\), which quantifies localization in time
- symmetry, which avoids dephasing
- number of zero moments, which helps with compression
- regularity, which helps with smooth and regular signals
- existence of scaling function \(\phi(t)\)
- orthogonality
- existence of explicit formula
- ease of calculation

Orthogonal wavelets with compact support

- Daubechies wavelets of order \(N\)
- Haat wavelet for \(N = 1\)
- No explicit expression for \(N > 1\)

- Symlets
- coifN

Orthogonal wavelets with non-compact support

- Meyer wavelet
- infinitely derivable
- not compactly supported, but converges fast to zero
- infinitely derivable
- derivatives converge fast to zero
- rapid decay

## 6. Log Gaussian Cox processes on the sphere

Francisco Cuevas-Pacheco, Jesper Møller 2018 https://doi.org/10.1016/j.spasta.2018.06.002

### 6.1. Introduction

- Although models and methods developed for planar and spatial point processes may be adapted, statistical methodology for point processes on the sphere is still not so developed
- Focus on Log Gaussian Crox processes for inhomogeneous aggregated/clustered point patterns on the sphere

### 6.2. Background

Let \(\mathbb{S}^{d}: \{x\in\mathbb{R}^{d+1}: \lvert x \rvert = 1\}\) be a $d$-dimensional unit sphere included in the (\(d+1\))-dimensional Euclidean space equipped with inner product \(\langle x, y\rangle = \sum_{i=0}^{d}{x_i y_i}\) and length \(\lvert x\rvert = \sqrt{\langle x, x\rangle}\). For unit vectors \(u,v\in\mathbb{S}^{d}\), let \(d(u,v) = \arccos(\langle u, v\rangle) = \arccos(\sum_{i=0}^{d}{u_i,v_i})\) be the geodesic distance on the sphere.

For a Poisson process \(\mathbf{X}\in\mathbb{S}^d\) with intensity function \(\lambda\), \(N(\mathbb{S}^d)\) is Poisson distributed with mean \(\int\lambda(u)\,\mathrm{d}u\) and, conditional on \(N(\mathbb{S}^d)\), the points in \(\mathbf{X}\) are iid with density proportal to \(\lambda\).

Let \(\mathbf{Z} = \{\mathbf{Z}(u): u\in\mathbb{S}^d\}\) be a non-negative random field satisfying some conditions. Then \(\mathbf{X} | \mathbf{Z}\) is a Poisson process on \(\mathbb{S}^d\) with intensity function \(\mathbf{Z}\). The intensity of \(\mathbf{X}\) is \(\lambda(u) = \mathbb{E}[\mathbf{Z}(u)]\), and the residual driving random field \(\mathbf{Z}_0(u) = \mathbf{Z} / \lambda(u)\) has pair correlation function \(g(u,v) = \mathbb{E}[\mathbf{Z}_0(u)\mathbf{Z}_0(v)]\).

### 6.3. Log Gaussian Cox processes

Set \(\mathbb{Z} = \exp(\mathbf{Y})\) where \(Y=\{Y(u): u\in\mathbb{S}^d\}\) is a Gaussian random field. Then, a LGCP \(\mathbf{X}\) has intensity function \(\lambda(u) = \exp(\mu(u) + C(u,u)/2)\) and pair correlation function \(g(u,v) = \exp(C(u,v))\) where \(\mu\) and \(C\) are the mean and covariance functions of the underlaying Gaussian random field.

### 6.4. Further remarks

For any integer \(n\ge2\), the $n$-th order pair correaltion function is given by \(g(u_1, \dots, u_n) = \exp(\sum_{1\le{i}\le{j}\le{n}} C(u_i,u_j))\). The conditional distribution \(\mathbf{X}(v) | \mathbf{X}(u)\) for \(v\in\{\mathbb{S}^d \backslash {u}\}\) has mean function \(m_u(v)=m(v) + C(u, v)\) and covariance function \(C(u, v)\) (covariance function remains unchanged).

## Footnotes:

^{1}

The original paper has a linear mean model that was assumed to be zero for these notes. GLS for linear coefficients should be easy to plug in.

^{2}

The paper starts with spatially or temporally ordered neighbors \(\left\{z_j: 1 \le j \le i - 1\right\}\).

^{3}

No need to increase \(m\) by an unit really.

^{4}

The paper has the \(\omega(t)\) function inside the squared parentheses.