This notebook is part of the material presented in R/Finance 2017. Please, see the README file.

\[ \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\EE}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\VV}[1]{\mathbb{V}\left[ #1 \right]} \newcommand{\RR}{\mathbb{r}} \newcommand{\II}{\mathbb{I}} \newcommand{\NN}{\mathbb{N}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bT}{\mathbf{T}} \newcommand{\btheta}{\mathbf{\theta}} \]

What’s all this about?

The aim of this notebook is twofold. First, I’d like to draw your attention to a small fact observed in financial assets prices when filtered through a Markov Switching GARCH model: when log returns are filtered through a GARCH model with Markovian dynamics, the belief states (low/high volatility) are correlated across assets. It’s not a game changer but hopefully it will trigger some new ideas to improve your trading strategy or risk model. Second, I’ll use this small fact as an excuse to introduce you to a few small concepts that certainly will be of help to any financial analyst.

You’ll be exposed to the following concepts:

  1. Mixture Normal and Markov Switching GARCH model.
  2. Hidden Markov Model and the forward algorithm.
  3. Bayesian statistics and Stan, a probabilistic programming language for statistical inference.
  4. Neat plots.

Worried by the broad scope? I’ll only explain the basics and leave many, many references. Hopefully, this will be enough exposure for you to decide if the topic is worth investing more effort.

Mixture of Normals Conditional Heteroskedasticity

Conditional Heteroskedasticity

Also known as volatility clustering, conditional heteroskedasticity is a key feature of economic time series, in particular of returns of financial assets. Relatively volatile periods, characterized by large price changes and large returns, alternate with more calm periods with stable prices and small returns (Franses, Dijk, and Opschoor 2014, 26).

Let \(P_t\) be the price of an asset at time \(t\) and \(r_t = log(P_t) - log(P_{t-1})\) be the log return. The expected return \(m_t\) is estimated by any model for the mean that leaves an unexplained component \(\epsilon_t\) called stochastic error or random shock:

\[ r_t = m_t + \epsilon_t. \] As will be seen later, the shock is related to the variance of the returns and may be modelled deterministically or stochastically. In the first case, the error is assumed to have the form \(\epsilon_t = v_t \sqrt{h_t}\), where \(v_t\) is a standardized random variable, i.e. independent and identically distributed with zero mean and unit variance. Note we’ve not assumed a Gaussian shape. Franses, Dijk, and Opschoor (2014) pp. 170 shows that the shocks have a constant unconditional variance equal to \(\sigma^2\) and a distribution conditional on past information \(\II_{t-1}\) with mean zero and variance \(h_t\). That is,

\[ \EE{\epsilon_t} = 0, \quad \VV{\epsilon_t} = \sigma^2, \quad \EE{\epsilon_t | \II_{t-1}} = 0, \quad \VV{\epsilon_t | \II_{t-1}} = h_t. \] Building upon Autoregressive Conditional Heteroskedasticity (ARCH), Bollerslev (1986) proposed the Generalized Autoregressive Conditional Heteroskedasticity (GARCH). The conditional variance is a deterministic function of its past values and past squared shocks:

\[ (1) \quad \quad h_t = \alpha_0 + \sum_{i=1}^{q}{\alpha_i \ \epsilon^2_{t-1}} + \sum_{j=1}^{p}{\beta_i \ h_{t-j}},\] with \(\alpha_0 > 0\), \(\alpha_i \ge 0\), \(\beta_j \ge 0\) and \(\sum_{i=1}^{max(m, s)}{(\alpha_i + \beta_j)} < 1\) (\(\alpha_i = 0 \ \forall \ i > m\) and \(\beta_j = 0 \ \forall \ j > s\)).

What does all this mean? Simply, that the volatility of returns - measured as the conditional second moment of the series - varies deterministically with time. Changes in conditional volatility are led by past volatility and shocks. The larger the last \(q\) shocks or the observed volatility in the last \(p\) periods, the more uncertain we are about the next return. It’s more uncertain to place a bet on today’s close price if the stock just moved far from its expected value. A fairly simple model indeed, but it makes sense.

By the way, the GARCH family is HUGE. There’s a vast literature originated in the academy during the 90’s and fueled by its extensive use in the industry in the 00’s. Many variations allow for more complex patterns like non-linearity, asymmetry and fatter tails: Integrated GARCH, GARCH in mean, Exponential GARCH, Threshold GARCH, GJR GARCH, Asymmetric Power GARCH, student t and Generalized Error Distribution, among many others. See Franses, Dijk, and Opschoor (2014) pp. 176 and Ghalanos (2015), Vignettes.

Mixture of Normals

A random variable \(\epsilon\) is said to have a univariate finite normal mixture distribution if the unconditional density is given by

\[ MN(\epsilon) = \sum_{j=1}^{K}{\lambda_j \ N(\epsilon | \mu_j, \ \sigma^2_j)}, \] where \(K \in \NN\) is the number of components, \(N(\mu, \ \sigma^2)\) is the Gaussian density with mean \(\mu\) and variance \(\sigma^2\), and \(\lambda_j\) are the mixing weights with \(\lambda_j > 0 \ \forall \ j\) and \(\sum_{j=1}^{k}{\lambda_j} = 1\) (Haas, Mittnik, and Paolella 2004).

A mixture of distribution is an old statistical trick to accomodate skewness and excess kurtosis. It has an extra feature in economics: the components are usually subject to simple yet useful interpretations. The variable of interest has a different mean and variance depending on which state the process is. In our model, states may represent recession/expansion, bearish/bullish, low/high volatility, risk off/on, and so on. Of course, the choice isn’t necessarily binary. Keep reading to know more!

Mixture of Normals + Conditional Heteroskedasticity

If the time series \(\{\epsilon_t\}\) is generated by a \(K\)-componentMixture of Normals \(GARCH(p, q)\) process, then the conditional density of the shock is given by \[ f(\epsilon_t | \II_{t-1}) = MN(\lambda_1, \dots, \lambda_K, h_{1t}, h_{Kt}),\] where \(\II_{t-1}\) is the information set at time \(t-1\), \(\lambda_i \in (0, 1) \ i = 1, \dots, K\) and \(\sum_{i=1}^{K}{\lambda_i} = 1\). In the simplest case (diagonal symmetric), each possible \(k = 1, \dots, K\) state has its own conditional volatility governed by its own GARCH deterministic dynamics and its own state-dependent parameters:

\[ (2) \quad \quad h_{kt} = \alpha_{k0} + \sum_{i=1}^{q}{\alpha_{ki} \ \epsilon^2_{t-1}} + \sum_{j=1}^{p}{\beta_{kj} \ h_{k, \ t-j}}. \]

This is interesting now. We only observe one series of log returns which follows only one path of conditional volatility. Nonetheless, volatility at any given time \(t\) comes from one of the \(K\) different states or regimes, each one with its own level and dynamics. Regime switching allows for non-linearity in the model and allow our estimates to quickly adjust to changes in the market.

Hidden Markov Model + Conditional Heteroskedasticity

Hidden Markov Model (HMM) involves two interconnected models. The state model consists of a discrete-time, discrete-state Markov chain with hidden states \(z_t \in \{1, \dots, K\}\) that transition according to \(p(z_t | z_{t-1})\). Additionally, the observation model is governed by \(p(\mat{y}_t | z_t)\), where \(\mat{y}_t\) are the observations, emissions or output. In our case, the latter is represented by the GARCH equation.

It is a specific instance of the state space model family in which the latent variables are discrete. Each single time slice corresponds to a mixture distribution with component densities given by \(p(\mat{y}_t | z_t)\), thus HMM may be interpreted as an extension of a mixture model in which the choice of component for each observation is not selected independently but depends on the choice of component for the previous observation. In the case of a simple mixture model for an identically independently distributed sample, the parameters of the transition matrix inside the \(i\)-th column are the same, so that the conditional distribution \(p(z_t | z_{t-1})\) is independent of \(z_{t-1}\).

In other words, HMM are equivalent to a Gaussian mixture model with cluster membership ruled by Markovian dynamics, also known as Markov Switching Models (MSM). Multiple sequential observations tend to share the same location until they suddenly jump into a new cluster.

Bayesian estimation with Stan

Specification

The Hidden Markov Model + Conditional Heteroskedasticity proposed above involves only \(K\) weights \(\lambda_1, \dots, \lambda_K\) that are constant over time. We further assume that the discrete \(K\) regimes follow a first-order Markov process led by transition probabilities \(\bp\). Since we don’t observe the states directly, we say they are hidden and we then estimate a probability distribution over the possible outputs. Remember that there is one set of parameters for the GARCH equation in each state.

Let \(z_{t}\) be the discrete component, state or regime at time point \(t\). The state dynamics are governed by the transition matrix \(\mat{\Psi}\) of size \(K \times K\) and elements \(\Psi(i, j) = p(z_{t} = j \ | \ z_{t-1} = i)\) with \(0 \le \Psi(i, j) \le 1\) and \(\sum_{j}{\Psi(i, j)} = 1\). The matrix has \(K(K-1)\) independent parameters. The initial latent node \(z_{1}\) doesn’t have a parent node and has a marginal distribution \(p(z_1)\).

The specification of the probabilistic model is completed by the conditional distributions of the observed variables \(p(\epsilon | z_t, \btheta)\) where \(\btheta\) are a set of parameters for this distribution. Usually known as the emission probabilities in the machine learning jargon, each element of this \(K\)-sized vector represents the probabilities that any given observation of the volatility corresponds to the \(K\) possible states.

Forward algorithm

The forward algorithm is used to estimate the belief state \(p(z_{t} | \ \II_{t})\), the joint probability of a state at a certain time and the information available up the moment. Estimating the \(K\)-sized vector is known as filtering. When run altogether with the backward algorithm, you get the smoothed probability \(p(z_{t} \ | \ \II_{T})\).

A filter infers the belief state based on all the available information up to that point \(p(z_t | \mat{y}_{1:t})\). It achieves better noise reduction than simply estimating the hidden state based on the current estimate. The filtering process can be run online, or recursively, as new data streams in.

Filtered maginals can be computed recursively by means of the forward algorithm Baum and Eagon (1967). Let \(\psi_t(j) = p(\mat{y}_t | z_t = j)\) be the local evidence at time \(t\) and \(\Psi(i, j) = p(z_t = j | z_{t-1} = i)\) be the transition probability. First, the one-step-ahead predictive density is computed

\[ p(z_t = j | \mat{y}_{1:t-1}) = \sum_{i}{\Psi(i, j) p(z_{t-1} = i | \mat{y}_{1:t-1})}. \]

Acting as prior information, this quantity is updated with observed data at the point \(t\) using Bayes rule, \[\begin{align*} \label{eq:filtered-belief_state} \alpha_t(j) & \triangleq p(z_t = j | \mat{y}_{1:t}) \\ &= p(z_t = j | \mat{y}_{t}, \mat{y}_{1:t-1}) \\ &= Z_t^{-1} \psi_t(j) p(z_t = j | \mat{y}_{1:t-1}) \\ &= Z_t^{-1} \psi_t(j) \alpha_{t-1}(j), \end{align*}\]

where the normalization constant is given by

\[ Z_t \triangleq p(\mat{y}_t | \mat{y}_{1:t-1}) = \sum_{l}{p(\mat{y}_{t} | z_t = l) p(z_t = l | \mat{y}_{1:t-1})}, = \sum_{l}{p(\mat{y}_{t} | z_t = l) \alpha_{t-1}(l)}. \]

This predict-update cycle results in the filtered belief states at point \(t\). As this algorithm only requires the evaluation of the quantities \(\psi_t(j)\) for each value of \(z_t\) for every \(t\) and fixed \(\mat{y}_t\), the posterior distribution of the latent states is independent of the form of the observation density or indeed of whether the observed variables are continuous or discrete Jordan (2003).

Let \(\mat{\alpha}_t\) be a \(K\)-sized vector with the filtered belief states at point \(t\), \(\mat{\psi}_t(j)\) be the \(K\)-sized vector of local evidence at point \(t\), \(\mat{\Psi}\) be the transition matrix and \(\mat{u} \odot \mat{v}\) is the Hadamard product, representing elementwise vector multiplication. Then, the bayesian updating procedure can be expressed in matrix notitation as

\[ \mat{\alpha}_t \propto \mat{\psi}_t \odot (\mat{\Psi}^T \mat{\alpha}_{t-1}). \]

In addition to computing the hidden states, the algorithm yields the log probability of the evidence

\[ \log p(\mat{y}_{1:T} | \mat{\theta}) = \sum_{t=1}^{T}{p(\mat{y}_{1:t} | \mat{y}_{1:t-1})} = \sum_{t=1}^{T}{\log Z_t}. \]

Log domain should be preferred to avoid numerical underflow.

Case Study

Let’s set math and computations aside for a while and focus on finance: what can we make out of a model like this?

Preamble

Before we get moving, make sure you’ve downloaded all the auxiliary file to the same folder. We first check all the required packages for this notebook are avaliable.

packages.req <- c('quantmod', 'rugarch', 'rstan', 'shinystan', 'lattice', 'gridExtra', 'gridBase', 'HH')
packages.mis <- packages.req[!(packages.req %in% rownames(installed.packages()))]

if (length(packages.mis)) {
  stop(paste('Please, first install the following packages:', packages.mis))
  # We could write a script to install them directly, but this may mess up other code of yours
}

Fetching the Data

We choose \(N = 5\) series from 2011-01-01 to 2016-12-31 (\(T = 1502\)): ^GSPC, F, GM, THO and AIR. The data is downloaded and pre-processed using quantmod (Ryan, Ulrich, and Thielen 2016).

library(quantmod)
syms <- c('^GSPC', 'F', 'GM', 'THO', 'AIR')

usecache = TRUE
cache.filename <- 'data/symsret.RDS'
if (usecache && file.exists(cache.filename)) {
  y_t <- readRDS(cache.filename)
} else {
  y_t = as.matrix(
    do.call(cbind, lapply(syms, function(s) {
      p <- getSymbols(s, src = "yahoo", from = "2011-01-01", to = "2016-12-31", auto.assign = FALSE)
      r <- periodReturn(p, period = 'daily', type = 'log')
      colnames(r) <- s
      r * 100
    })))
}

Inference

The (very) naive implementation of the algorithm in Stan is only meant for illustration. A few good practices were neglected, convergence is not guaranteed, there is much room left for optimization and fitting N different independent models is probably not a reasonable choice for production sampler. The main takeaway of this presentation is the ideas behind the code but not the code itself.

We’ll load Stan (2016) and run the model for \(K = 2\) states. Since the sampler may take a few minutes, I’ve built a very basic caching feature.

library(rstan)
rstan_options(auto_write = TRUE)  # Writes down the compiled sampler
options(mc.cores = parallel::detectCores()) # Use all available cores

K = 2 # Number of states
bmodel = 'stan/hmmgarch.stan' # Name of the model file
standata = list(
  N = ncol(y_t),
  T = nrow(y_t),
  K = K,
  y = t(y_t)
)

usecache = TRUE
cache.filename <- paste0('stan/hmmgarch-cache.rds')
if (usecache && file.exists(cache.filename)) {
  stan.fit <- readRDS(cache.filename)
} else {
  stan.fit <- stan(file = bmodel,
                   model_name = "Bayesian Hierarchical Mixture GARCH",
                   data = standata, verbose = T,
                   iter = 200, warmup = 100, thin = 1, chains = 4, cores = 4,
                   control = list(adapt_delta = 0.80))
  
  saveRDS(stan.fit, file = cache.filename)
}

We’ll also fit a single regime GARCH model with rugarch (Ghalanos 2015). Sidenote: I’ve used rugarch considerably and it proved to be very reliable, it’s a fine work of software engineering.

library(rugarch)
SR <- lapply(1:ncol(y_t), function(n) {
  mle.specSR <- ugarchspec(
    variance.model = list(model = "sGARCH", garchOrder = c(1, 1),
                          submodel = NULL, external.regressors = NULL, variance.targeting = FALSE),
    mean.model = list(armaOrder = c(0, 0), include.mean = FALSE, archm = FALSE),
    distribution.model = 'norm'
  )
  mle.fitSR  <- ugarchfit(mle.specSR, y_t[, n])
  list(as.numeric(sigma(mle.fitSR)), uncvariance(mle.fitSR))
})

Plots

We’ll make extensive use of plots, mainly based on Lattice (Sarkar 2008). For better organization, plot logic is consolidated into an auxiliary script.

source('R/plots.R')

Findings

We filter the daily log returns through a single regime \(GARCH(1, 1)\) model with Gaussian density.

  q <- queue[[1]]
  tscsplot(q$getMat(), q$getMain())