This notebook is part of the material presented in R/Finance 2017. Please, see the README file.
\[ \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\EE}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\VV}[1]{\mathbb{V}\left[ #1 \right]} \newcommand{\RR}{\mathbb{r}} \newcommand{\II}{\mathbb{I}} \newcommand{\NN}{\mathbb{N}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bT}{\mathbf{T}} \newcommand{\btheta}{\mathbf{\theta}} \]
The aim of this notebook is twofold. First, I’d like to draw your attention to a small fact observed in financial assets prices when filtered through a Markov Switching GARCH model: when log returns are filtered through a GARCH model with Markovian dynamics, the belief states (low/high volatility) are correlated across assets. It’s not a game changer but hopefully it will trigger some new ideas to improve your trading strategy or risk model. Second, I’ll use this small fact as an excuse to introduce you to a few small concepts that certainly will be of help to any financial analyst.
You’ll be exposed to the following concepts:
Worried by the broad scope? I’ll only explain the basics and leave many, many references. Hopefully, this will be enough exposure for you to decide if the topic is worth investing more effort.
Also known as volatility clustering, conditional heteroskedasticity is a key feature of economic time series, in particular of returns of financial assets. Relatively volatile periods, characterized by large price changes and large returns, alternate with more calm periods with stable prices and small returns (Franses, Dijk, and Opschoor 2014, 26).
Let \(P_t\) be the price of an asset at time \(t\) and \(r_t = log(P_t) - log(P_{t-1})\) be the log return. The expected return \(m_t\) is estimated by any model for the mean that leaves an unexplained component \(\epsilon_t\) called stochastic error or random shock:
\[ r_t = m_t + \epsilon_t. \] As will be seen later, the shock is related to the variance of the returns and may be modelled deterministically or stochastically. In the first case, the error is assumed to have the form \(\epsilon_t = v_t \sqrt{h_t}\), where \(v_t\) is a standardized random variable, i.e. independent and identically distributed with zero mean and unit variance. Note we’ve not assumed a Gaussian shape. Franses, Dijk, and Opschoor (2014) pp. 170 shows that the shocks have a constant unconditional variance equal to \(\sigma^2\) and a distribution conditional on past information \(\II_{t-1}\) with mean zero and variance \(h_t\). That is,
\[ \EE{\epsilon_t} = 0, \quad \VV{\epsilon_t} = \sigma^2, \quad \EE{\epsilon_t | \II_{t-1}} = 0, \quad \VV{\epsilon_t | \II_{t-1}} = h_t. \] Building upon Autoregressive Conditional Heteroskedasticity (ARCH), Bollerslev (1986) proposed the Generalized Autoregressive Conditional Heteroskedasticity (GARCH). The conditional variance is a deterministic function of its past values and past squared shocks:
\[ (1) \quad \quad h_t = \alpha_0 + \sum_{i=1}^{q}{\alpha_i \ \epsilon^2_{t-1}} + \sum_{j=1}^{p}{\beta_i \ h_{t-j}},\] with \(\alpha_0 > 0\), \(\alpha_i \ge 0\), \(\beta_j \ge 0\) and \(\sum_{i=1}^{max(m, s)}{(\alpha_i + \beta_j)} < 1\) (\(\alpha_i = 0 \ \forall \ i > m\) and \(\beta_j = 0 \ \forall \ j > s\)).
What does all this mean? Simply, that the volatility of returns - measured as the conditional second moment of the series - varies deterministically with time. Changes in conditional volatility are led by past volatility and shocks. The larger the last \(q\) shocks or the observed volatility in the last \(p\) periods, the more uncertain we are about the next return. It’s more uncertain to place a bet on today’s close price if the stock just moved far from its expected value. A fairly simple model indeed, but it makes sense.
By the way, the GARCH family is HUGE. There’s a vast literature originated in the academy during the 90’s and fueled by its extensive use in the industry in the 00’s. Many variations allow for more complex patterns like non-linearity, asymmetry and fatter tails: Integrated GARCH, GARCH in mean, Exponential GARCH, Threshold GARCH, GJR GARCH, Asymmetric Power GARCH, student t and Generalized Error Distribution, among many others. See Franses, Dijk, and Opschoor (2014) pp. 176 and Ghalanos (2015), Vignettes.
A random variable \(\epsilon\) is said to have a univariate finite normal mixture distribution if the unconditional density is given by
\[ MN(\epsilon) = \sum_{j=1}^{K}{\lambda_j \ N(\epsilon | \mu_j, \ \sigma^2_j)}, \] where \(K \in \NN\) is the number of components, \(N(\mu, \ \sigma^2)\) is the Gaussian density with mean \(\mu\) and variance \(\sigma^2\), and \(\lambda_j\) are the mixing weights with \(\lambda_j > 0 \ \forall \ j\) and \(\sum_{j=1}^{k}{\lambda_j} = 1\) (Haas, Mittnik, and Paolella 2004).
A mixture of distribution is an old statistical trick to accomodate skewness and excess kurtosis. It has an extra feature in economics: the components are usually subject to simple yet useful interpretations. The variable of interest has a different mean and variance depending on which state the process is. In our model, states may represent recession/expansion, bearish/bullish, low/high volatility, risk off/on, and so on. Of course, the choice isn’t necessarily binary. Keep reading to know more!
If the time series \(\{\epsilon_t\}\) is generated by a \(K\)-componentMixture of Normals \(GARCH(p, q)\) process, then the conditional density of the shock is given by \[ f(\epsilon_t | \II_{t-1}) = MN(\lambda_1, \dots, \lambda_K, h_{1t}, h_{Kt}),\] where \(\II_{t-1}\) is the information set at time \(t-1\), \(\lambda_i \in (0, 1) \ i = 1, \dots, K\) and \(\sum_{i=1}^{K}{\lambda_i} = 1\). In the simplest case (diagonal symmetric), each possible \(k = 1, \dots, K\) state has its own conditional volatility governed by its own GARCH deterministic dynamics and its own state-dependent parameters:
\[ (2) \quad \quad h_{kt} = \alpha_{k0} + \sum_{i=1}^{q}{\alpha_{ki} \ \epsilon^2_{t-1}} + \sum_{j=1}^{p}{\beta_{kj} \ h_{k, \ t-j}}. \]
This is interesting now. We only observe one series of log returns which follows only one path of conditional volatility. Nonetheless, volatility at any given time \(t\) comes from one of the \(K\) different states or regimes, each one with its own level and dynamics. Regime switching allows for non-linearity in the model and allow our estimates to quickly adjust to changes in the market.
The Hidden Markov Model + Conditional Heteroskedasticity proposed above involves only \(K\) weights \(\lambda_1, \dots, \lambda_K\) that are constant over time. We further assume that the discrete \(K\) regimes follow a first-order Markov process led by transition probabilities \(\bp\). Since we don’t observe the states directly, we say they are hidden and we then estimate a probability distribution over the possible outputs. Remember that there is one set of parameters for the GARCH equation in each state.
Let \(z_{t}\) be the discrete component, state or regime at time point \(t\). The state dynamics are governed by the transition matrix \(\mat{\Psi}\) of size \(K \times K\) and elements \(\Psi(i, j) = p(z_{t} = j \ | \ z_{t-1} = i)\) with \(0 \le \Psi(i, j) \le 1\) and \(\sum_{j}{\Psi(i, j)} = 1\). The matrix has \(K(K-1)\) independent parameters. The initial latent node \(z_{1}\) doesn’t have a parent node and has a marginal distribution \(p(z_1)\).
The specification of the probabilistic model is completed by the conditional distributions of the observed variables \(p(\epsilon | z_t, \btheta)\) where \(\btheta\) are a set of parameters for this distribution. Usually known as the emission probabilities in the machine learning jargon, each element of this \(K\)-sized vector represents the probabilities that any given observation of the volatility corresponds to the \(K\) possible states.
The forward algorithm is used to estimate the belief state \(p(z_{t} | \ \II_{t})\), the joint probability of a state at a certain time and the information available up the moment. Estimating the \(K\)-sized vector is known as filtering. When run altogether with the backward algorithm, you get the smoothed probability \(p(z_{t} \ | \ \II_{T})\).
A filter infers the belief state based on all the available information up to that point \(p(z_t | \mat{y}_{1:t})\). It achieves better noise reduction than simply estimating the hidden state based on the current estimate. The filtering process can be run online, or recursively, as new data streams in.
Filtered maginals can be computed recursively by means of the forward algorithm Baum and Eagon (1967). Let \(\psi_t(j) = p(\mat{y}_t | z_t = j)\) be the local evidence at time \(t\) and \(\Psi(i, j) = p(z_t = j | z_{t-1} = i)\) be the transition probability. First, the one-step-ahead predictive density is computed
\[ p(z_t = j | \mat{y}_{1:t-1}) = \sum_{i}{\Psi(i, j) p(z_{t-1} = i | \mat{y}_{1:t-1})}. \]
Acting as prior information, this quantity is updated with observed data at the point \(t\) using Bayes rule, \[\begin{align*} \label{eq:filtered-belief_state} \alpha_t(j) & \triangleq p(z_t = j | \mat{y}_{1:t}) \\ &= p(z_t = j | \mat{y}_{t}, \mat{y}_{1:t-1}) \\ &= Z_t^{-1} \psi_t(j) p(z_t = j | \mat{y}_{1:t-1}) \\ &= Z_t^{-1} \psi_t(j) \alpha_{t-1}(j), \end{align*}\]where the normalization constant is given by
\[ Z_t \triangleq p(\mat{y}_t | \mat{y}_{1:t-1}) = \sum_{l}{p(\mat{y}_{t} | z_t = l) p(z_t = l | \mat{y}_{1:t-1})}, = \sum_{l}{p(\mat{y}_{t} | z_t = l) \alpha_{t-1}(l)}. \]
This predict-update cycle results in the filtered belief states at point \(t\). As this algorithm only requires the evaluation of the quantities \(\psi_t(j)\) for each value of \(z_t\) for every \(t\) and fixed \(\mat{y}_t\), the posterior distribution of the latent states is independent of the form of the observation density or indeed of whether the observed variables are continuous or discrete Jordan (2003).
Let \(\mat{\alpha}_t\) be a \(K\)-sized vector with the filtered belief states at point \(t\), \(\mat{\psi}_t(j)\) be the \(K\)-sized vector of local evidence at point \(t\), \(\mat{\Psi}\) be the transition matrix and \(\mat{u} \odot \mat{v}\) is the Hadamard product, representing elementwise vector multiplication. Then, the bayesian updating procedure can be expressed in matrix notitation as
\[ \mat{\alpha}_t \propto \mat{\psi}_t \odot (\mat{\Psi}^T \mat{\alpha}_{t-1}). \]
In addition to computing the hidden states, the algorithm yields the log probability of the evidence
\[ \log p(\mat{y}_{1:T} | \mat{\theta}) = \sum_{t=1}^{T}{p(\mat{y}_{1:t} | \mat{y}_{1:t-1})} = \sum_{t=1}^{T}{\log Z_t}. \]
Log domain should be preferred to avoid numerical underflow.
Let’s set math and computations aside for a while and focus on finance: what can we make out of a model like this?
Before we get moving, make sure you’ve downloaded all the auxiliary file to the same folder. We first check all the required packages for this notebook are avaliable.
packages.req <- c('quantmod', 'rugarch', 'rstan', 'shinystan', 'lattice', 'gridExtra', 'gridBase', 'HH')
packages.mis <- packages.req[!(packages.req %in% rownames(installed.packages()))]
if (length(packages.mis)) {
stop(paste('Please, first install the following packages:', packages.mis))
# We could write a script to install them directly, but this may mess up other code of yours
}
We choose \(N = 5\) series from 2011-01-01 to 2016-12-31 (\(T = 1502\)): ^GSPC, F, GM, THO and AIR. The data is downloaded and pre-processed using quantmod (Ryan, Ulrich, and Thielen 2016).
library(quantmod)
syms <- c('^GSPC', 'F', 'GM', 'THO', 'AIR')
usecache = TRUE
cache.filename <- 'data/symsret.RDS'
if (usecache && file.exists(cache.filename)) {
y_t <- readRDS(cache.filename)
} else {
y_t = as.matrix(
do.call(cbind, lapply(syms, function(s) {
p <- getSymbols(s, src = "yahoo", from = "2011-01-01", to = "2016-12-31", auto.assign = FALSE)
r <- periodReturn(p, period = 'daily', type = 'log')
colnames(r) <- s
r * 100
})))
}
The (very) naive implementation of the algorithm in Stan is only meant for illustration. A few good practices were neglected, convergence is not guaranteed, there is much room left for optimization and fitting N different independent models is probably not a reasonable choice for production sampler. The main takeaway of this presentation is the ideas behind the code but not the code itself.
We’ll load Stan (2016) and run the model for \(K = 2\) states. Since the sampler may take a few minutes, I’ve built a very basic caching feature.
library(rstan)
rstan_options(auto_write = TRUE) # Writes down the compiled sampler
options(mc.cores = parallel::detectCores()) # Use all available cores
K = 2 # Number of states
bmodel = 'stan/hmmgarch.stan' # Name of the model file
standata = list(
N = ncol(y_t),
T = nrow(y_t),
K = K,
y = t(y_t)
)
usecache = TRUE
cache.filename <- paste0('stan/hmmgarch-cache.rds')
if (usecache && file.exists(cache.filename)) {
stan.fit <- readRDS(cache.filename)
} else {
stan.fit <- stan(file = bmodel,
model_name = "Bayesian Hierarchical Mixture GARCH",
data = standata, verbose = T,
iter = 200, warmup = 100, thin = 1, chains = 4, cores = 4,
control = list(adapt_delta = 0.80))
saveRDS(stan.fit, file = cache.filename)
}
We’ll also fit a single regime GARCH model with rugarch (Ghalanos 2015). Sidenote: I’ve used rugarch considerably and it proved to be very reliable, it’s a fine work of software engineering.
library(rugarch)
SR <- lapply(1:ncol(y_t), function(n) {
mle.specSR <- ugarchspec(
variance.model = list(model = "sGARCH", garchOrder = c(1, 1),
submodel = NULL, external.regressors = NULL, variance.targeting = FALSE),
mean.model = list(armaOrder = c(0, 0), include.mean = FALSE, archm = FALSE),
distribution.model = 'norm'
)
mle.fitSR <- ugarchfit(mle.specSR, y_t[, n])
list(as.numeric(sigma(mle.fitSR)), uncvariance(mle.fitSR))
})
We’ll make extensive use of plots, mainly based on Lattice (Sarkar 2008). For better organization, plot logic is consolidated into an auxiliary script.
source('R/plots.R')
We filter the daily log returns through a single regime \(GARCH(1, 1)\) model with Gaussian density.
q <- queue[[1]]
tscsplot(q$getMat(), q$getMain())